Re: [Dbp-spotlight-users] Character encoding Raw Data

Pablo N. Mendes Fri, 16 May 2014 08:31:49 -0700

This is the kind of thing where the disk backed models could be useful, no?
Would take longer, but at indexing it is OK to wait. A second step would
just translate to the Mem models.
 On May 16, 2014 7:12 AM, "Alex Olieman" <[email protected]> wrote:


>  A short update for future readers:
> Most of my issues were punishment for modifing the raw data with Windows.
>
>    - The charset issue can be avoided by using "java ...
>    -Dfile.encoding=UTF-8 -cp ..."
>    - For the array index issue: could be tabs, but also: line endings
>    should not include carriage returns! (at least when creating the model on
>    linux; Win has no problem with this)
>
> Now it's just a matter of finding enough memory to run this ;-)
>
> On 16-5-2014 15:02, Joachim Daiber wrote:
>
> Check if there are lines with too many tab characters in them. You might
> have to remove them manually. I fixed that upstream (by ignoring the line
> if this is the case), I think but you might run into that problem. For
> creating full English, you need around 20-25GB of memory.
>
>  Jo
>
>
> On Fri, May 16, 2014 at 2:48 PM, Alex Olieman <[email protected]> wrote:
>
>>  Hey Jo,
>>
>> Thanks a lot! This must be the case; when using the same files on a
>> different machine, I got a bit further.
>>
>> Now it still fails on parsing sfAndTotalCounts, but on a
>> java.lang.ArrayIndexOutOfBoundsException (without traceback). Should I
>> assume there is a line with only one column in the file? Perhaps I messed
>> up the TSV format with my modifications. Is there any quoting or escape
>> character I should know about when reading & writing the TSV files?
>>
>> Btw, how much ram is needed to create the model from files in your
>> experience? My modified files are about 60% the size of the originals.
>>
>> Best,
>> Alex
>>
>>
>> On 16-5-2014 14:22, Joachim Daiber wrote:
>>
>> Hey Alex,
>>
>>  they should be utf-8. I get the same error as you if not all of my bash
>> lang vars are set. Try to check with "locale" if all are set and do export
>> LC_ALL, etc with utf-8 if they are not.
>>
>>  Best,
>> Jo
>>
>>
>> On Fri, May 16, 2014 at 12:55 PM, Alex Olieman <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Lately I've been tinkering with the raw data to see if I can create a
>>> model with a filtered set of disambiguation candidates, and that spots
>>> more lowercase surface forms. When I train the model on the modified
>>> data, however, I'm faced with character encoding issues.
>>>
>>> For example:
>>> >  INFO 2014-05-16 03:16:29,310 main [WikipediaToDBpediaClosure] - Done.
>>> >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Creating
>>> > SurfaceFormSource...
>>> >  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Reading
>>> > annotated and total counts...
>>> > Exception in thread "main"
>>> > java.nio.charset.UnmappableCharacterException: Input length = 1
>>> >         at java.nio.charset.CoderResult.throwException(Unknown Source)
>>> >         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
>>>
>>> I had expected these files to be encoded in utf-8, but it looks like
>>> this isn't the case. The chardet library tells me it is ISO-8859-2
>>> a.k.a. Latin-2 instead. Can someone tell me in which character encoding
>>> the raw data (pig output) files should be for db.CreateSpotlightModel to
>>> read them correctly? If this really should be one of the "Western"
>>> character sets, I would expect it to be Latin-1 instead.
>>>
>>> Best,
>>> Alex
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
>>> Instantly run your Selenium tests across 300+ browser/OS combos.
>>> Get unparalleled scalability from the best Selenium testing platform
>>> available
>>> Simple to use. Nothing to install. Get started now for free."
>>> http://p.sf.net/sfu/SauceLabs
>>> _______________________________________________
>>> Dbp-spotlight-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>>
>>
>>
>>
>
>
>
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get unparalleled scalability from the best Selenium testing platform
> available
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] Character encoding Raw Data

Reply via email to