I noticed that the "labels_en" file has duplicate rows,  something 
that wasn't the case in the last one.

    I found 157 of these but here's a particularly annoying one:

$ bzcat ~/dbpedia_3.5.1/labels_en.nt.bz2 | grep '/resource/SS>'
<http://dbpedia.org/resource/SS> 
<http://www.w3.org/2000/01/rdf-schema#label> "SS"@en .
<http://dbpedia.org/resource/SS> 
<http://www.w3.org/2000/01/rdf-schema#label> "SS"@en .

    These are 120k lines between those in the log file,  so I've got no 
idea what the etiology of this is.

    I liked the old alphabetical order:  it was very efficient to build 
a clustered index on with a minimum of I/O.  ;-) As it is I'll probably 
crank up my memory limit,  keep a hashtable of the resource URLs I've 
seen,  and expect the index build at the end to take a little more time...

------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to