Re: [Dbp-spotlight-users] Character encoding Raw Data

Alex Olieman Fri, 16 May 2014 05:49:37 -0700

Hey Jo,

Thanks a lot! This must be the case; when using the same files on adifferent machine, I got a bit further.

Now it still fails on parsing sfAndTotalCounts, but on ajava.lang.ArrayIndexOutOfBoundsException (without traceback). Should Iassume there is a line with only one column in the file? Perhaps Imessed up the TSV format with my modifications. Is there any quoting orescape character I should know about when reading & writing the TSV files?

Btw, how much ram is needed to create the model from files in yourexperience? My modified files are about 60% the size of the originals.


Best,
Alex

On 16-5-2014 14:22, Joachim Daiber wrote:

Hey Alex,

they should be utf-8. I get the same error as you if not all of mybash lang vars are set. Try to check with "locale" if all are set anddo export LC_ALL, etc with utf-8 if they are not.


Best,
Jo

On Fri, May 16, 2014 at 12:55 PM, Alex Olieman <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    Lately I've been tinkering with the raw data to see if I can create a
    model with a filtered set of disambiguation candidates, and that spots
    more lowercase surface forms. When I train the model on the modified
    data, however, I'm faced with character encoding issues.

    For example:
    >  INFO 2014-05-16 03:16:29,310 main [WikipediaToDBpediaClosure] -
    Done.
    >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Creating
    > SurfaceFormSource...
    >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Reading
    > annotated and total counts...
    > Exception in thread "main"
    > java.nio.charset.UnmappableCharacterException: Input length = 1
    >         at java.nio.charset.CoderResult.throwException(Unknown
    Source)
    >         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)

    I had expected these files to be encoded in utf-8, but it looks like
    this isn't the case. The chardet library tells me it is ISO-8859-2
    a.k.a. Latin-2 instead. Can someone tell me in which character
    encoding
    the raw data (pig output) files should be for
    db.CreateSpotlightModel to
    read them correctly? If this really should be one of the "Western"
    character sets, I would expect it to be Latin-1 instead.

    Best,
    Alex


    
------------------------------------------------------------------------------
    "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
    Instantly run your Selenium tests across 300+ browser/OS combos.
    Get unparalleled scalability from the best Selenium testing
    platform available
    Simple to use. Nothing to install. Get started now for free."
    http://p.sf.net/sfu/SauceLabs
    _______________________________________________
    Dbp-spotlight-users mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] Character encoding Raw Data

Reply via email to