Hi Andrea, Rupert, Rupert, maybe you can help. Summary: DBpedia backslash escaping is (most likely) correct since 3.8. Stanbol / Jena can read the DBpedia 3.8 files fine if they are uncompressed first. It looks like Stanbol has a problem with bz2.
https://issues.apache.org/jira/browse/STANBOL-804 http://markmail.org/message/67ivlyoxfqad6xoe Cheers, JC On 21 March 2013 10:20, Andrea Di Menna <ninn...@gmail.com> wrote: > Hi Jona, > > I compressed the nt file with bzip2 > > andread@build04:~/tools/apache-jena-2.7.4/bin$ bzip2 --version > bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010. > > Copyright (C) 1996-2010 by Julian Seward. > > This program is free software; you can redistribute it and/or modify > it under the terms set out in the LICENSE file, which is included > in the bzip2-1.0.6 source distribution. > > This program is distributed in the hope that it will be useful, > but WITHOUT ANY WARRANTY; without even the implied warranty of > MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > LICENSE file for more details. > > Also, I now tried with the same file mentioned in the JIRA bug [1], using > both Jena 2.7.4 and 2.10.0 tdbloader2, and got the following: > > 1) Same exception as below when running on bz2 file > 2) No exception with uncompressed nt file > > But I remember seeing the same exceptions as the ones in the JIRA issue when > using Stanbol indexing tool (which is building a TDB from source RDF files, > before building the Solr index). > It is likely then that the Stanbol code is not acting as the tdbloader2 when > processing RDF files. > > WDYT? > > Cheers > Andrea > > [1] http://downloads.dbpedia.org/3.8/es/redirects_es.nt.bz2 > > > 2013/3/21 Jona Christopher Sahnwaldt <j...@sahnwaldt.de> >> >> On 20 March 2013 20:10, Andrea Di Menna <ninn...@gmail.com> wrote: >> > Hi Jona, >> > >> > I have tried loading labels_en_uris_de.nt.bz2 from the DBpedia 3.8 >> > release >> > using both Jena 2.7.4 and 2.10.0, but both fail with the following >> > error: >> > >> > andread@build04:~/tools/apache-jena-2.10.0/bin$ ./tdbloader2 --loc . >> > /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2 >> > 19:48:02 -- TDB Bulk Loader Start >> > 19:48:02 Data phase >> > INFO Load: >> > /media/HD2/data/dbpedia-3.8-archive/source_data/labels_en_uris_de.nt.bz2 >> > -- >> > 2013/03/20 19:48:03 CET >> > Exception in thread "main" org.apache.jena.atlas.AtlasException: >> > java.nio.charset.MalformedInputException: Input length = 1 >> > at org.apache.jena.atlas.io.IO.exception(IO.java:154) >> > at >> > >> > org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:79) >> > at >> > >> > org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:156) >> > at >> > >> > org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:139) >> > at >> > org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:251) >> > at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:244) >> > at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:169) >> > at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:108) >> > at >> > >> > org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:41) >> > at org.apache.jena.riot.RiotReader.createParser(RiotReader.java:130) >> > at org.apache.jena.riot.RiotReader.parse(RiotReader.java:115) >> > at org.apache.jena.riot.RiotReader.parse(RiotReader.java:93) >> > at org.apache.jena.riot.RiotReader.parse(RiotReader.java:66) >> > at >> > >> > com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:162) >> > at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101) >> > at arq.cmdline.CmdMain.mainRun(CmdMain.java:63) >> > at arq.cmdline.CmdMain.mainRun(CmdMain.java:50) >> > at >> > >> > com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80) >> > Caused by: java.nio.charset.MalformedInputException: Input length = 1 >> > at java.nio.charset.CoderResult.throwException(CoderResult.java:277) >> > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338) >> > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) >> > at java.io.InputStreamReader.read(InputStreamReader.java:184) >> > at java.io.Reader.read(Reader.java:140) >> > ... 17 more >> > >> > Anyway, I have now tried the following: >> > >> > 1) Download german labels >> > 2) Run tdbloader2 on the bz2 nt file -> failure >> > 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS >> > 4) Compress the nt file again -> failure >> > >> > Looks like Jena is having some problems with bz2 files then. >> >> Interesting. >> >> Since 3.8, we use parallel bzip2 [1] to compress the files (it's much >> faster on multi-core machines). The files created by pbzip2 have a >> slightly different format though. Legal for bzip2, but for example >> older versions of Commons Compress cannot deal with it [2][3]. >> >> > 2) Run tdbloader2 on the bz2 nt file -> failure >> > 3) Uncompress the bz2 file and run tdbloader2 -> SUCCESS >> >> This very much looks like compression is the culprit, not DBpedia >> encoding. >> >> > 4) Compress the nt file again -> failure >> >> This is a bit weird. How do you compress the file? >> >> >> Cheers, >> JC >> >> [1] http://compression.ca/pbzip2/ >> [2] https://issues.apache.org/jira/browse/COMPRESS-146 >> [3] https://issues.apache.org/jira/browse/COMPRESS-162 >> >> > Would you mind giving it a try? >> > >> > But anyway please check this JIRA issue out >> > https://issues.apache.org/jira/browse/STANBOL-804 >> > >> > Cheers >> > Andrea >> > >> > >> > 2013/3/20 Jona Christopher Sahnwaldt <j...@sahnwaldt.de> >> >> >> >> Hi Andrea, >> >> >> >> there used to be encoding problems, but I think they are all fixed >> >> since the 3.8 release. I tried very hard to make TurtleEscaper do the >> >> right thing - I checked the relevant standards etc. Could you give an >> >> example where Jena complains about a DBpedia 3.8 file? >> >> >> >> Cheers, >> >> JC >> >> >> >> On Wed, Mar 20, 2013 at 6:16 PM, Andrea Di Menna <ninn...@gmail.com> >> >> wrote: >> >> > Hi, >> >> > >> >> > I have been using Stanbol [1] to process DBpedia data files and build >> >> > a >> >> > dbpedia Solr index. >> >> > Stanbol is using Jena TDB in order to load DBpedia files into a >> >> > triple >> >> > store. >> >> > Unfortunately, almost all the DBpedia N-Triples files must be >> >> > pre-processed >> >> > before being able to import them using Jena [2]. >> >> > >> >> > The following sed command is launched: >> >> > >> >> > sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' >> >> > >> >> > Basically the backslash is replaced with the unicode character escape >> >> > sequence. >> >> > >> >> > Do you think this should/could be fixed in >> >> > org.dbpedia.extraction.util.TurtleEscaper#escapeTurtle ? >> >> > >> >> > Cheers >> >> > Andrea >> >> > >> >> > [1] http://stanbol.apache.org/ >> >> > [2] >> >> > >> >> > >> >> > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > Everyone hates slow websites. So do we. >> >> > Make your web apps faster with AppDynamics >> >> > Download AppDynamics Lite for free today: >> >> > http://p.sf.net/sfu/appdyn_d2d_mar >> >> > _______________________________________________ >> >> > Dbpedia-discussion mailing list >> >> > Dbpedia-discussion@lists.sourceforge.net >> >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >> >> > >> > >> > > > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion