Hi Dario,
the dataset you are using is extracted by
the org.dbpedia.extraction.mappings.PageLinksExtractor [1].
This extractor collects internal wiki links [2] from Wikipedia content
articles (that is, wikipedia pages which belong to the Main namespace [3])
to other wikipedia pages (please note I am not talking about content
articles here, because also links to pages in the File or Category
namespaces are collected).
Each row - triple <subject> <predicate> <object> - in the Pagelinks
represent a directed link between two pages, e.g.
<http://dbpedia.org/resource/Albedo>
<http://dbpedia.org/ontology/wikiPageWikiLink>
<http://dbpedia.org/resource/Latin> .
means that an internal link to http://en.wikipedia.org/wiki/Latin was
found in http://en.wikipedia.org/wiki/Albedo.
You can check this link exists here (first sentence) [6]
Basically this can be modeled in a directed graph as an edge "Albedo -> Latin"
The reason why you have 17M instances (I suppose you are counting the nodes
in your graph) is because objects in each triple can be outside the Main
namespace.
As far as I remember, 4M articles are wiki pages with belong to the Main
namespace and which are neither redirects [4] nor disambiguation pages [5].
Hope this clarifies a bit :-)
Cheers
Andrea
[1]
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/PageLinksExtractor.scala
[2] https://en.wikipedia.org/wiki/Help:Link
[3] https://en.wikipedia.org/wiki/Wikipedia:Main_namespace
[4] https://en.wikipedia.org/wiki/Wikipedia:Redirect
[5] https://en.wikipedia.org/wiki/Wikipedia:Disambiguation
[6] https://en.wikipedia.org/wiki/Albedo
2013/12/2 Dario Garcia Gasulla <dar...@lsi.upc.edu>
> Hi,
>
> I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC).
>
> I'm currently doing research on very large directed graphs and I am using
> one of your datasets for testing. Concretly, I am using the "Wikipedia
> Pagelinks" dataset as available in the DBpedia web site.
>
> Unfortunately the description of the dataset is not very detailed:
> Wikipedia Pagelinks *Dataset containing internal links between DBpedia
> instances. The dataset was created from the internal links between
> Wikipedia articles. The dataset might be useful for structural analysis,
> data mining or for ranking DBpedia instances using Page Rank or similar
> algorithms.*
>
> I wonder if you could give me more information on how the dataset was
> built and what composes it.
> I understand Wikipedia has 4M articles and 31M pages, while this dataset
> has 17M instances and 130M links (couldn't find the number of links of
> Wikipedia).
>
> What's the relation between both? Could someone briefly explain the nature
> of the Pagelinks dataset and the differences with the Wikipedia?
>
> Thank you for your time,
> Dario.
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion