Hi, Lately I've been tinkering with the raw data to see if I can create a model with a filtered set of disambiguation candidates, and that spots more lowercase surface forms. When I train the model on the modified data, however, I'm faced with character encoding issues.
For example: > INFO 2014-05-16 03:16:29,310 main [WikipediaToDBpediaClosure] - Done. > INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Creating > SurfaceFormSource... > INFO 2014-05-16 03:16:29,388 main [SurfaceFormSource$] - Reading > annotated and total counts... > Exception in thread "main" > java.nio.charset.UnmappableCharacterException: Input length = 1 > at java.nio.charset.CoderResult.throwException(Unknown Source) > at sun.nio.cs.StreamDecoder.implRead(Unknown Source) I had expected these files to be encoded in utf-8, but it looks like this isn't the case. The chardet library tells me it is ISO-8859-2 a.k.a. Latin-2 instead. Can someone tell me in which character encoding the raw data (pig output) files should be for db.CreateSpotlightModel to read them correctly? If this really should be one of the "Western" character sets, I would expect it to be Latin-1 instead. Best, Alex ------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs _______________________________________________ Dbp-spotlight-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
