Hi Pablo,
I'm getting similar errors while parsing some Wikipedia articles. For
instance:
INFO 2012-11-28 10:11:22,555 main [FileOccurrenceSource$] - saved
11200000 occurrences
nov 28, 2012 10:11:43 AM
org.dbpedia.extraction.sources.WikipediaDumpParser readPage
Advertencia: Error processing page
title=S/mileage;ns=0/Main/;language:wiki=es,locale=es
org.dbpedia.extraction.wikiparser.impl.simple.TooManyErrorsException:
Too many errors at '|align="center"| 11 || {{nihongo|[[Suki yo, Junjou
Hankouki.]] (??????????) || ' (line: 120)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:111)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTableCell(SimpleWikiParser.scala:575)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTableRow(SimpleWikiParser.scala:557)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTable(SimpleWikiParser.scala:536)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:268)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseProperty(SimpleWikiParser.scala:468)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTemplate(SimpleWikiParser.scala:446)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:264)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTableCell(SimpleWikiParser.scala:575)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTableRow(SimpleWikiParser.scala:557)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseTable(SimpleWikiParser.scala:536)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.createNode(SimpleWikiParser.scala:268)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.parseUntil(SimpleWikiParser.scala:194)
at
org.dbpedia.extraction.wikiparser.impl.simple.SimpleWikiParser.apply(SimpleWikiParser.scala:69)
at
org.dbpedia.spotlight.io.AllOccurrenceSource$AllOccurrenceSource$$anonfun$foreach$1.apply(AllOccurrenceSource.scala:82)
at
org.dbpedia.spotlight.io.AllOccurrenceSource$AllOccurrenceSource$$anonfun$foreach$1.apply(AllOccurrenceSource.scala:80)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:253)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:179)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:137)
at
org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:108)
at
org.dbpedia.extraction.sources.XMLReaderSource.foreach(XMLSource.scala:57)
at
org.dbpedia.spotlight.io.AllOccurrenceSource$AllOccurrenceSource.foreach(AllOccurrenceSource.scala:80)
at
org.dbpedia.spotlight.filter.Filter$FilteredOccs.foreach(Filter.scala:58)
at
org.dbpedia.spotlight.filter.Filter$FilteredOccs.foreach(Filter.scala:58)
at
org.dbpedia.spotlight.filter.Filter$FilteredOccs.foreach(Filter.scala:58)
at
org.dbpedia.spotlight.io.FileOccurrenceSource$.writeToFile(FileOccurrenceSource.scala:57)
at
org.dbpedia.spotlight.lucene.index.ExtractOccsFromWikipedia$.main(ExtractOccsFromWikipedia.scala:82)
at
org.dbpedia.spotlight.lucene.index.ExtractOccsFromWikipedia.main(ExtractOccsFromWikipedia.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at scala_maven_executions.MainHelper.runMain(MainHelper.java:164)
at
scala_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
I suppose that this should be a completely different problem but, is it
possible to try to fix it too? Any clue?
Regards
El 29/11/12 13:25, Pablo N. Mendes escribió:
Hi Rafa,
I don't think so. The warning you got was very specific: "Illegal
character in path at index 40" which is where the "\u5BCC" occurs. See:
warning in NqParser.next on line 2364225 # <BAD URI: Illegal character
in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
Great. We'd love to have you send us a pull request with the fixes for
this. Max has produced a pretty detailed guide on how to contribute
that takes all the roadblocks away from your path:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Contributing
Cheers,
Pablo
On Thu, Nov 29, 2012 at 1:21 PM, Rafa Haro <[email protected]
<mailto:[email protected]>> wrote:
Hi Pablo,
Thanks for your response. Of course I don't mind to change it.
Anyway, is it possible that the issue had been produced by having
the "spanish" namespaces (http://es.dbpedia.org/resource/ and
http://es.dbpedia.org/ontology/) instead of the default ones??
Thanks. Regards
El 29/11/12 12:04, Pablo N. Mendes escribió:
Hi Rafa,
It looks like NxParser (or our code based on it) is failing to
parse the unicode characters in your URIs. We now have Any23 in
our dependencies, and we thought about losing NxParser for good.
But I am not sure they will handle unicode either, see:
http://code.google.com/p/any23/source/browse/trunk/any23-core/src/main/java/org/deri/any23/parser/NQuadsParser.java?r=1305
But Any23 is now apache incubating and has a growing community,
so if it doesn't work right of the bat, we could try to get help
there to fix their side of things.
Would you like to give this a shot? It would be a matter of
changing the getTypesMap method to use Any23's iteration, rather
than NxParser's. See:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/core/src/main/scala/org/dbpedia/spotlight/util/TypesLoader.scala#L82
Cheers,
Pablo
On Wed, Nov 28, 2012 at 6:46 PM, Rafa Haro <[email protected]
<mailto:[email protected]>> wrote:
Hi,
I finally have generated the indexes for Spanish. Checking
them with Luke, I have realized that my index
/index-withSF-withTypes/ doesn't contain the field Type.
Apparently, the AddTypesToIndex launcher has been executed
without any error. Just this warnings:
INFO] launcher 'AddTypesToIndex' selected =>
org.dbpedia.spotlight.lucene.index.AddTypesToIndex
INFO 2012-11-28 12:40:22,470 main [IndexingConfiguration] -
Loading configuration file ../conf/indexing.properties
INFO 2012-11-28 12:40:22,932 main
[MergedOccurrencesContextSearcher] - Using index at:
org.apache.lucene.store.MMapDirectory@/usr/local/spotlight/dbpedia_data/data/output/index-withSFlockFactory=org.apache.lucene.store.NativeFSLockFactory@7a06cf15
INFO 2012-11-28 12:40:24,114 main [IndexEnricher] - Analyzer
class: class org.apache.lucene.analysis.es.SpanishAnalyzer
INFO 2012-11-28 12:40:24,219 main [TypesLoader$] - Loading
types map...
warning on line 1 # started 2012-06-04T14:02:57Z : cannot
parse 0th element: # started 2012-06-04T14:02:57Z
warning in NqParser.next on line 2364225 # <BAD URI: Illegal
character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://dbpedia.org/ontology/Film>
<http://dbpedia.org/ontology/Film> . : cannot parse 0th
element: # <BAD URI: Illegal character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://dbpedia.org/ontology/Film>
<http://dbpedia.org/ontology/Film> .
warning in NqParser.next on line 2364226 # <BAD URI: Illegal
character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/Movie> <http://schema.org/Movie> . :
cannot parse 0th element: # <BAD URI: Illegal character in
path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/Movie> <http://schema.org/Movie> .
warning in NqParser.next on line 2364227 # <BAD URI: Illegal
character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://dbpedia.org/ontology/Work>
<http://dbpedia.org/ontology/Work> . : cannot parse 0th
element: # <BAD URI: Illegal character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://dbpedia.org/ontology/Work>
<http://dbpedia.org/ontology/Work> .
warning in NqParser.next on line 2364228 # <BAD URI: Illegal
character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/CreativeWork>
<http://schema.org/CreativeWork> . : cannot parse 0th
element: # <BAD URI: Illegal character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/CreativeWork>
<http://schema.org/CreativeWork> .
warning in NqParser.next on line 2364229 # <BAD URI: Illegal
character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2002/07/owl#Thing>
<http://www.w3.org/2002/07/owl#Thing> . : cannot parse 0th
element: # <BAD URI: Illegal character in path at index 40:
http://es.dbpedia.org/resource/Tomie__\u5BCC\u6C5F\u3000\u6700\u7D42\u7AE0\uFF5E\u7981\u65AD\u306E\u679C\u5B9F\uFF5E__1
<http://es.dbpedia.org/resource/Tomie__%5Cu5BCC%5Cu6C5F%5Cu3000%5Cu6700%5Cu7D42%5Cu7AE0%5CuFF5E%5Cu7981%5Cu65AD%5Cu306E%5Cu679C%5Cu5B9F%5CuFF5E__1>>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2002/07/owl#Thing>
<http://www.w3.org/2002/07/owl#Thing> .
warning in NqParser.next on line 2725295 # completed
2012-06-04T14:31:53Z : cannot parse 0th element: # completed
2012-06-04T14:31:53Z
INFO 2012-11-28 12:41:07,523 main [TypesLoader$] - Done.
Loaded 2202361 types.
INFO 2012-11-28 12:41:07,530 main [IndexEnricher] - Adding
types to index
org.apache.lucene.store.MMapDirectory@/usr/local/spotlight/dbpedia_data/data/output/index-withSF-withTypeslockFactory=org.apache.lucene.store.NativeFSLockFactory@458d6f3d...
INFO 2012-11-28 12:41:07,612 main [IndexEnricher] -
processed 0 documents.
INFO 2012-11-28 12:41:09,190 main [IndexEnricher] -
processed 1000 documents.
..........................
............................
The process continues until process 870019 documents, but
then the field doesn't exist.
Anyone knows what can be happening?
Thanks in advance
This message should be regarded as confidential. If you have received
this email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by
an authorised signatory.
Zaizi Ltd is registered in England and Wales with the registration
number 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam
Road, London W10 5JJ, UK.
------------------------------------------------------------------------------
Keep yourself connected to Go Parallel:
INSIGHTS What's next for parallel hardware, programming and
related areas?
Interviews and blogs by thought leaders keep you ahead of the
curve.
http://goparallel.sourceforge.net
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
--
Pablo N. Mendes
http://pablomendes.com
This message should be regarded as confidential. If you have received this
email in error please notify the sender and destroy it immediately. Statements
of intent shall only become binding when confirmed in hard copy by an
authorised signatory.
Zaizi Ltd is registered in England and Wales with the registration number
6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
London W10 5JJ, UK.
--
Pablo N. Mendes
http://pablomendes.com
This message should be regarded as confidential. If you have received this
email in error please notify the sender and destroy it immediately. Statements
of intent shall only become binding when confirmed in hard copy by an
authorised signatory.
Zaizi Ltd is registered in England and Wales with the registration number
6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
London W10 5JJ, UK.
------------------------------------------------------------------------------
Keep yourself connected to Go Parallel:
VERIFY Test and improve your parallel project with help from experts
and peers. http://goparallel.sourceforge.net
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users