Hi Anthony,
On 03/16/2013 04:54 AM, Anthony Lalande wrote:
Hi,
I have been working with a few select files from the v3.8 dump of
DBpedia, and have noticed duplicate entries in one of the files,
*images_en.nt*.
The entire file is 1.4 GiB, and contains 7 370 587 lines.
I came across one statement, which is present in this file 10 times:
<http://upload.wikimedia.org/wikipedia/commons/3/32/CentralMichiganChippewas.png
<http://purl.org/dc/elements/1.1/rights
<http://en.wikipedia.org/wiki/File:CentralMichiganChippewas.png> .
This one statement is present on lines:
2997045
3588625
5294480
5424560
5798660
5910525
6009955
6516790
6894525
7338075
Can someone tell me why this is? There may be other instances, but
I've only come across this one, and I wanted to check with the
community-at-large to see if this is known and/or intentional.
If you look at resources [1] and [2] for example, you will notice that
both of them refer to the same image, i.e. "CentralMichiganChippewas.png".
Upon running, the image extractor extracts the image along with its rights.
So, the rights of that image are extracted twice.
You can use the method described here [3], to remove the duplicates from
the file.
Thanks,
- A
Hope that helps.
[1] http://dbpedia.org/page/2010_Central_Michigan_Chippewas_football_team
[2] http://dbpedia.org/page/2011_Central_Michigan_Chippewas_football_team
[3]
http://www.unix.com/shell-programming-scripting/20364-remove-duplicate-lines-file.html
--
Kind Regards
Mohamed Morsey
Department of Computer Science
University of Leipzig
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion