[Dbp-spotlight-users] Character encoding Raw Data

Alex Olieman Fri, 16 May 2014 04:21:14 -0700

Hi,

Lately I've been tinkering with the raw data to see if I can create a 
model with a filtered set of disambiguation candidates, and that spots 
more lowercase surface forms. When I train the model on the modified 
data, however, I'm faced with character encoding issues.


For example:
>  INFO 2014-05-16 03:16:29,310 main [WikipediaToDBpediaClosure] - Done.
>  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Creating 
> SurfaceFormSource...
>  INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Reading 
> annotated and total counts...
> Exception in thread "main" 
> java.nio.charset.UnmappableCharacterException: Input length = 1
>         at java.nio.charset.CoderResult.throwException(Unknown Source)
>         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)

I had expected these files to be encoded in utf-8, but it looks like 
this isn't the case. The chardet library tells me it is ISO-8859-2 
a.k.a. Latin-2 instead. Can someone tell me in which character encoding 
the raw data (pig output) files should be for db.CreateSpotlightModel to 
read them correctly? If this really should be one of the "Western" 
character sets, I would expect it to be Latin-1 instead.

Best,
Alex


------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

[Dbp-spotlight-users] Character encoding Raw Data

Reply via email to