Re: [Dbp-spotlight-users] Character encoding Raw Data

Alex Olieman Fri, 16 May 2014 07:12:29 -0700

A short update for future readers:
Most of my issues were punishment for modifing the raw data with Windows.


 * The charset issue can be avoided by using "java ...
   -Dfile.encoding=UTF-8 -cp ..."
 * For the array index issue: could be tabs, but also: line endings
   should not include carriage returns! (at least when creating the
   model on linux; Win has no problem with this)

Now it's just a matter of finding enough memory to run this ;-)


On 16-5-2014 15:02, Joachim Daiber wrote:

Check if there are lines with too many tab characters in them. Youmight have to remove them manually. I fixed that upstream (by ignoringthe line if this is the case), I think but you might run into thatproblem. For creating full English, you need around 20-25GB of memory.

Jo

On Fri, May 16, 2014 at 2:48 PM, Alex Olieman <[email protected]<mailto:[email protected]>> wrote:


    Hey Jo,

    Thanks a lot! This must be the case; when using the same files on
    a different machine, I got a bit further.

    Now it still fails on parsing sfAndTotalCounts, but on a
    java.lang.ArrayIndexOutOfBoundsException (without traceback).
    Should I assume there is a line with only one column in the file?
    Perhaps I messed up the TSV format with my modifications. Is there
    any quoting or escape character I should know about when reading &
    writing the TSV files?

    Btw, how much ram is needed to create the model from files in your
    experience? My modified files are about 60% the size of the originals.

    Best,
    Alex


    On 16-5-2014 14:22, Joachim Daiber wrote:

    Hey Alex,

    they should be utf-8. I get the same error as you if not all of
    my bash lang vars are set. Try to check with "locale" if all are
    set and do export LC_ALL, etc with utf-8 if they are not.

    Best,
    Jo


    On Fri, May 16, 2014 at 12:55 PM, Alex Olieman <[email protected]
    <mailto:[email protected]>> wrote:

        Hi,

        Lately I've been tinkering with the raw data to see if I can
        create a
        model with a filtered set of disambiguation candidates, and
        that spots
        more lowercase surface forms. When I train the model on the
        modified
        data, however, I'm faced with character encoding issues.

        For example:
        >  INFO 2014-05-16 03:16:29,310 main
        [WikipediaToDBpediaClosure] - Done.
        >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] -
        Creating
        > SurfaceFormSource...
        >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] -
        Reading
        > annotated and total counts...
        > Exception in thread "main"
        > java.nio.charset.UnmappableCharacterException: Input length = 1
        >         at
        java.nio.charset.CoderResult.throwException(Unknown Source)
        >         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)

        I had expected these files to be encoded in utf-8, but it
        looks like
        this isn't the case. The chardet library tells me it is
        ISO-8859-2
        a.k.a. Latin-2 instead. Can someone tell me in which
        character encoding
        the raw data (pig output) files should be for
        db.CreateSpotlightModel to
        read them correctly? If this really should be one of the
        "Western"
        character sets, I would expect it to be Latin-1 instead.

        Best,
        Alex


        
------------------------------------------------------------------------------
        "Accelerate Dev Cycles with Automated Cross-Browser Testing -
        For FREE
        Instantly run your Selenium tests across 300+ browser/OS combos.
        Get unparalleled scalability from the best Selenium testing
        platform available
        Simple to use. Nothing to install. Get started now for free."
        http://p.sf.net/sfu/SauceLabs
        _______________________________________________
        Dbp-spotlight-users mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] Character encoding Raw Data

Reply via email to