I noticed that the "labels_en" file has duplicate rows, something
that wasn't the case in the last one.
I found 157 of these but here's a particularly annoying one:
$ bzcat ~/dbpedia_3.5.1/labels_en.nt.bz2 | grep '/resource/SS>'
<http://dbpedia.org/resource/SS>
<http://www.w3.org/2000/01/rdf-schema#label> "SS"@en .
<http://dbpedia.org/resource/SS>
<http://www.w3.org/2000/01/rdf-schema#label> "SS"@en .
These are 120k lines between those in the log file, so I've got no
idea what the etiology of this is.
I liked the old alphabetical order: it was very efficient to build
a clustered index on with a minimum of I/O. ;-) As it is I'll probably
crank up my memory limit, keep a hashtable of the resource URLs I've
seen, and expect the index build at the end to take a little more time...
------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion