Hi Anthony,

On 03/16/2013 04:54 AM, Anthony Lalande wrote:
Hi,

I have been working with a few select files from the v3.8 dump of DBpedia, and have noticed duplicate entries in one of the files, *images_en.nt*.

The entire file is 1.4 GiB, and contains 7 370 587 lines.

I came across one statement, which is present in this file 10 times:
<http://upload.wikimedia.org/wikipedia/commons/3/32/CentralMichiganChippewas.png
  <http://purl.org/dc/elements/1.1/rights
  <http://en.wikipedia.org/wiki/File:CentralMichiganChippewas.png> .


This one statement is present on lines:
  2997045
  3588625
  5294480
  5424560
  5798660
  5910525
  6009955
  6516790
  6894525
  7338075


Can someone tell me why this is? There may be other instances, but I've only come across this one, and I wanted to check with the community-at-large to see if this is known and/or intentional.

If you look at resources [1] and [2] for example, you will notice that both of them refer to the same image, i.e. "CentralMichiganChippewas.png".
Upon running, the image extractor extracts the image along with its rights.
So, the rights of that image are extracted twice.
You can use the method described here [3], to remove the duplicates from the file.


Thanks,
- A

Hope that helps.


[1] http://dbpedia.org/page/2010_Central_Michigan_Chippewas_football_team
[2] http://dbpedia.org/page/2011_Central_Michigan_Chippewas_football_team
[3] http://www.unix.com/shell-programming-scripting/20364-remove-duplicate-lines-file.html

--
Kind Regards
Mohamed Morsey
Department of Computer Science
University of Leipzig

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to