Hey Jo,

Thanks a lot! This must be the case; when using the same files on a different machine, I got a bit further.

Now it still fails on parsing sfAndTotalCounts, but on a java.lang.ArrayIndexOutOfBoundsException (without traceback). Should I assume there is a line with only one column in the file? Perhaps I messed up the TSV format with my modifications. Is there any quoting or escape character I should know about when reading & writing the TSV files?

Btw, how much ram is needed to create the model from files in your experience? My modified files are about 60% the size of the originals.

Best,
Alex

On 16-5-2014 14:22, Joachim Daiber wrote:
Hey Alex,

they should be utf-8. I get the same error as you if not all of my bash lang vars are set. Try to check with "locale" if all are set and do export LC_ALL, etc with utf-8 if they are not.

Best,
Jo


On Fri, May 16, 2014 at 12:55 PM, Alex Olieman <[email protected] <mailto:[email protected]>> wrote:

    Hi,

    Lately I've been tinkering with the raw data to see if I can create a
    model with a filtered set of disambiguation candidates, and that spots
    more lowercase surface forms. When I train the model on the modified
    data, however, I'm faced with character encoding issues.

    For example:
    >  INFO 2014-05-16 03:16:29,310 main [WikipediaToDBpediaClosure] -
    Done.
    >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Creating
    > SurfaceFormSource...
    >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Reading
    > annotated and total counts...
    > Exception in thread "main"
    > java.nio.charset.UnmappableCharacterException: Input length = 1
    >         at java.nio.charset.CoderResult.throwException(Unknown
    Source)
    >         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)

    I had expected these files to be encoded in utf-8, but it looks like
    this isn't the case. The chardet library tells me it is ISO-8859-2
    a.k.a. Latin-2 instead. Can someone tell me in which character
    encoding
    the raw data (pig output) files should be for
    db.CreateSpotlightModel to
    read them correctly? If this really should be one of the "Western"
    character sets, I would expect it to be Latin-1 instead.

    Best,
    Alex


    
------------------------------------------------------------------------------
    "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
    Instantly run your Selenium tests across 300+ browser/OS combos.
    Get unparalleled scalability from the best Selenium testing
    platform available
    Simple to use. Nothing to install. Get started now for free."
    http://p.sf.net/sfu/SauceLabs
    _______________________________________________
    Dbp-spotlight-users mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users



------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to