These errors are related to the SimpleWikiParser used in the DEF (DBpedia
Extraction Framework) that is reused in DBpedia Spotlight. We've always had
these, and we lived with them so far. If this is creating serious problems
for you, let us know. Also, if you want to take a crack at fixing it, we'd
definitely welcome the contrib.
Cheers,
Pablo
On Thu, Nov 29, 2012 at 1:45 PM, Rafa Haro <[email protected]> wrote:
> Hi Pablo,
>
> I'm getting similar errors while parsing some Wikipedia articles. For
> instance:
>
> INFO 2012-11-28 10:11:22,555 main [FileOccurrenceSource$] - saved
> 11200000 occurrences
> nov 28, 2012 10:11:43 AM
> org.dbpedia.extraction.sources.WikipediaDumpParser readPage
> Advertencia: Error processing page
> title=S/mileage;ns=0/Main/;language:wiki=es,locale=es
> org.dbpedia.extraction.wikiparser.impl.simple.TooManyErrorsException: Too
> many errors at '|align="center"| 11 || {{nihongo|[[Suki yo, Junjou
> Hankouki.]] (好きよ、純情反抗期。) || ' (line: 120)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:111)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTableCell(SimpleWikiParser.scala:575)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTableRow(SimpleWikiParser.scala:557)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTable(SimpleWikiParser.scala:536)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:268)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTableCell(SimpleWikiParser.scala:575)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTableRow(SimpleWikiParser.scala:557)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTable(SimpleWikiParser.scala:536)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:268)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
> at
> org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.apply(SimpleWikiParser.scala:69)
> at
> org.dbpedia.spotlight.io.AllOccurrenceSource$AllOccurrenceSource$$anonfun$foreach$1.apply(AllOccurrenceSource.scala:82)
> at
> org.dbpedia.spotlight.io.AllOccurrenceSource$AllOccurrenceSource$$anonfun$foreach$1.apply(AllOccurrenceSource.scala:80)
> at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:253)
> at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:179)
> at
> org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:137)
> at
> org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:108)
> at
> org.dbpedia.extraction.sources.XMLReaderSource.foreach(XMLSource.scala:57)
> at
> org.dbpedia.spotlight.io.AllOccurrenceSource$AllOccurrenceSource.foreach(AllOccurrenceSource.scala:80)
> at
> org.dbpedia.spotlight.filter.Filter$FilteredOccs.foreach(Filter.scala:58)
> at
> org.dbpedia.spotlight.filter.Filter$FilteredOccs.foreach(Filter.scala:58)
> at
> org.dbpedia.spotlight.filter.Filter$FilteredOccs.foreach(Filter.scala:58)
> at
> org.dbpedia.spotlight.io.FileOccurrenceSource$.writeToFile(FileOccurrenceSource.scala:57)
> at
> org.dbpedia.spotlight.lucene.index.ExtractOccsFromWikipedia$.main(ExtractOccsFromWikipedia.scala:82)
> at
> org.dbpedia.spotlight.lucene.index.ExtractOccsFromWikipedia.main(ExtractOccsFromWikipedia.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at scala_maven_executions.MainHelper.runMain(MainHelper.java:164)
> at
> scala_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
>
> I suppose that this should be a completely different problem but, is it
> possible to try to fix it too? Any clue?
>
> Regards
> El 29/11/12 13:25, Pablo N. Mendes escribió:
>
>
> Hi Rafa,
> I don't think so. The warning you got was very specific: "Illegal
> character in path at index 40" which is where the "\u5BCC" occurs. See:
>
> warning in NqParser.next on line 2364225 # <BAD URI: Illegal character
> in path at index 40:
> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
> >
>
> Great. We'd love to have you send us a pull request with the fixes for
> this. Max has produced a pretty detailed guide on how to contribute that
> takes all the roadblocks away from your path:
> https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Contributing
>
> Cheers,
> Pablo
>
>
>
> On Thu, Nov 29, 2012 at 1:21 PM, Rafa Haro <[email protected]> wrote:
>
>> Hi Pablo,
>>
>> Thanks for your response. Of course I don't mind to change it. Anyway, is
>> it possible that the issue had been produced by having the "spanish"
>> namespaces (http://es.dbpedia.org/resource/ and
>> http://es.dbpedia.org/ontology/) instead of the default ones??
>>
>> Thanks. Regards
>>
>> El 29/11/12 12:04, Pablo N. Mendes escribió:
>>
>>
>> Hi Rafa,
>> It looks like NxParser (or our code based on it) is failing to parse the
>> unicode characters in your URIs. We now have Any23 in our dependencies, and
>> we thought about losing NxParser for good. But I am not sure they will
>> handle unicode either, see:
>>
>> http://code.google.com/p/any23/source/browse/trunk/any23-core/src/main/java/org/deri/any23/parser/NQuadsParser.java?r=1305
>>
>> But Any23 is now apache incubating and has a growing community, so if
>> it doesn't work right of the bat, we could try to get help there to fix
>> their side of things.
>>
>> Would you like to give this a shot? It would be a matter of changing
>> the getTypesMap method to use Any23's iteration, rather than NxParser's.
>> See:
>>
>> https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/core/src/main/scala/org/dbpedia/spotlight/util/TypesLoader.scala#L82
>>
>> Cheers,
>> Pablo
>>
>>
>> On Wed, Nov 28, 2012 at 6:46 PM, Rafa Haro <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I finally have generated the indexes for Spanish. Checking them with
>>> Luke, I have realized that my index *index-withSF-withTypes* doesn't
>>> contain the field Type. Apparently, the AddTypesToIndex launcher has been
>>> executed without any error. Just this warnings:
>>>
>>> INFO] launcher 'AddTypesToIndex' selected =>
>>> org.dbpedia.spotlight.lucene.index.AddTypesToIndex
>>> INFO 2012-11-28 12:40:22,470 main [IndexingConfiguration] - Loading
>>> configuration file ../conf/indexing.properties
>>> INFO 2012-11-28 12:40:22,932 main [MergedOccurrencesContextSearcher] -
>>> Using index at:
>>> org.apache.lucene.store.MMapDirectory@/usr/local/spotlight/dbpedia_data/data/output/index-withSFlockFactory=org.apache.lucene.store.NativeFSLockFactory@7a06cf15
>>> INFO 2012-11-28 12:40:24,114 main [IndexEnricher] - Analyzer class:
>>> class org.apache.lucene.analysis.es.SpanishAnalyzer
>>> INFO 2012-11-28 12:40:24,219 main [TypesLoader$] - Loading types map...
>>> warning on line 1 # started 2012-06-04T14:02:57Z : cannot parse 0th
>>> element: # started 2012-06-04T14:02:57Z
>>> warning in NqParser.next on line 2364225 # <BAD URI: Illegal character
>>> in path at index 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://dbpedia.org/ontology/Film> <http://dbpedia.org/ontology/Film> .
>>> : cannot parse 0th element: # <BAD URI: Illegal character in path at index
>>> 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://dbpedia.org/ontology/Film> <http://dbpedia.org/ontology/Film> .
>>> warning in NqParser.next on line 2364226 # <BAD URI: Illegal character
>>> in path at index 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://schema.org/Movie> <http://schema.org/Movie> . : cannot parse
>>> 0th element: # <BAD URI: Illegal character in path at index 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://schema.org/Movie> <http://schema.org/Movie> .
>>> warning in NqParser.next on line 2364227 # <BAD URI: Illegal character
>>> in path at index 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://dbpedia.org/ontology/Work> <http://dbpedia.org/ontology/Work> .
>>> : cannot parse 0th element: # <BAD URI: Illegal character in path at index
>>> 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://dbpedia.org/ontology/Work> <http://dbpedia.org/ontology/Work> .
>>> warning in NqParser.next on line 2364228 # <BAD URI: Illegal character
>>> in path at index 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://schema.org/CreativeWork> <http://schema.org/CreativeWork> . :
>>> cannot parse 0th element: # <BAD URI: Illegal character in path at index
>>> 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://schema.org/CreativeWork> <http://schema.org/CreativeWork> .
>>> warning in NqParser.next on line 2364229 # <BAD URI: Illegal character
>>> in path at index 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://www.w3.org/2002/07/owl#Thing><http://www.w3.org/2002/07/owl#Thing>.
>>> : cannot parse 0th element: # <BAD URI: Illegal character in path at
>>> index 40:
>>> http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1>
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <http://www.w3.org/2002/07/owl#Thing><http://www.w3.org/2002/07/owl#Thing>.
>>> warning in NqParser.next on line 2725295 # completed
>>> 2012-06-04T14:31:53Z : cannot parse 0th element: # completed
>>> 2012-06-04T14:31:53Z
>>> INFO 2012-11-28 12:41:07,523 main [TypesLoader$] - Done. Loaded 2202361
>>> types.
>>> INFO 2012-11-28 12:41:07,530 main [IndexEnricher] - Adding types to
>>> index
>>> org.apache.lucene.store.MMapDirectory@/usr/local/spotlight/dbpedia_data/data/output/index-withSF-withTypeslockFactory=org.apache.lucene.store.NativeFSLockFactory@458d6f3d
>>> ...
>>> INFO 2012-11-28 12:41:07,612 main [IndexEnricher] - processed 0
>>> documents.
>>> INFO 2012-11-28 12:41:09,190 main [IndexEnricher] - processed 1000
>>> documents.
>>> ..........................
>>> ............................
>>>
>>> The process continues until process 870019 documents, but then the field
>>> doesn't exist.
>>>
>>> Anyone knows what can be happening?
>>>
>>> Thanks in advance
>>>
>>> This message should be regarded as confidential. If you have received this
>>> email in error please notify the sender and destroy it immediately.
>>> Statements of intent shall only become binding when confirmed in hard copy
>>> by an authorised signatory.
>>>
>>> Zaizi Ltd is registered in England and Wales with the registration number
>>> 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
>>> London W10 5JJ, UK.
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Keep yourself connected to Go Parallel:
>>> INSIGHTS What's next for parallel hardware, programming and related
>>> areas?
>>> Interviews and blogs by thought leaders keep you ahead of the curve.
>>> http://goparallel.sourceforge.net
>>> _______________________________________________
>>> Dbp-spotlight-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>>>
>>>
>>
>>
>> --
>>
>> Pablo N. Mendes
>> http://pablomendes.com
>>
>>
>> This message should be regarded as confidential. If you have received
>> this email in error please notify the sender and destroy it immediately.
>> Statements of intent shall only become binding when confirmed in hard copy
>> by an authorised signatory.
>>
>> Zaizi Ltd is registered in England and Wales with the registration number
>> 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
>> London W10 5JJ, UK.
>>
>>
>
>
> --
>
> Pablo N. Mendes
> http://pablomendes.com
>
>
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy by
> an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
> London W10 5JJ, UK.
>
>
--
Pablo N. Mendes
http://pablomendes.com
------------------------------------------------------------------------------
Keep yourself connected to Go Parallel:
VERIFY Test and improve your parallel project with help from experts
and peers. http://goparallel.sourceforge.net
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users