Re: [Dbpedia-discussion] Pagelinks dataset
2013/12/4 Paul Houle ontolo...@gmail.com I think I could get this data out of some API, but there are great HTML 5 parsing libraries now, so a link extractor from HTML can be built as quickly than an API client. There are two big advantages of looking at links in HTML: (i) you can use the same software to analyze multiple sites, and (ii) the HTML output is often the most tested output of a system. This is particularly a problem in the case of Wikipedia markup which has no formal specification and for which the editors aren't concerned if the markup is clean but they will fix problems if they cause the HTML to look wrong. Another advantage of HTML is that you can work from a static dump file, Where can you get such dump from? or run a web crawler against the real Wikipedia Seems not practical or against a local copy of Wikipedia loaded from the database dump files. Pretty slow, isn't it? Cheers! Andrea On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna ninn...@gmail.com wrote: I guess Paul wanted to know which book is cited by one wikipedia page (e.g. page A cites book x). If I am not wrong by asking template transclusions you only get the first part of the triple (page A). Paul, your use case is interesting. At the moment we are not dealing with the {{cite}} template nor {{cite book}} etc. We are looking into extensions which could support similar use cases anyway. Also please note that at the moment the framework does not handle references either (i.e. what is inside ref/ref) when using the SimpleWikiParser [1] From a quick exploration I see this template is used mainly for references. What do you exactly mean when you talk about Wikipedia HTML? Do you refer to HTML dumps of the whole wikipedia? Cheers Andrea [1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172 2013/12/3 Tom Morris tfmor...@gmail.com On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com wrote: Something I found out recently is that the page links don't capture links that are generated by macros, in particular almost all of the links to pages like http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 don't show up because they are generated by the {cite} macro. These can be easily extracted from the Wikipedia HTML of course, That's good to know, but couldn't you get this directly from the Wikimedia API without resorting to HTML parsing by asking for template calls to http://en.wikipedia.org/wiki/Template:Cite ? Tom -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254paul.houle on Skype ontol...@gmail.com -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
Re: [Dbpedia-discussion] Pagelinks dataset
@Paul, unfortunately HTML wikipedia dumps are not released anymore (they are old static dumps as you said). This is a problem for a project like DBpedia, as you can easily understand. Moreover, I did not mean that it is not possible to crawl Wikipedia instances or load dump into a private Mediawiki instance (the latter is what happens when abstracts are extracted), I am just saying that this is probably not practical for a project like DBpedia which extracts data from multiple wikipedias. Cheers Andrea 2013/12/5 Paul Houle ontolo...@gmail.com @Andrea, there are old static dumps available, but I can say that running the web crawler is not at all difficult. I got a list of topics by looking at the ?s for DBpedia descriptions and then wrote a very simple single-threaded crawler that took a few days to run on a micro instance in AWS. The main key to writing a successful web crawler is keeping it simple. On Dec 5, 2013 4:23 AM, Andrea Di Menna ninn...@gmail.com wrote: 2013/12/4 Paul Houle ontolo...@gmail.com I think I could get this data out of some API, but there are great HTML 5 parsing libraries now, so a link extractor from HTML can be built as quickly than an API client. There are two big advantages of looking at links in HTML: (i) you can use the same software to analyze multiple sites, and (ii) the HTML output is often the most tested output of a system. This is particularly a problem in the case of Wikipedia markup which has no formal specification and for which the editors aren't concerned if the markup is clean but they will fix problems if they cause the HTML to look wrong. Another advantage of HTML is that you can work from a static dump file, Where can you get such dump from? or run a web crawler against the real Wikipedia Seems not practical or against a local copy of Wikipedia loaded from the database dump files. Pretty slow, isn't it? Cheers! Andrea On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna ninn...@gmail.com wrote: I guess Paul wanted to know which book is cited by one wikipedia page (e.g. page A cites book x). If I am not wrong by asking template transclusions you only get the first part of the triple (page A). Paul, your use case is interesting. At the moment we are not dealing with the {{cite}} template nor {{cite book}} etc. We are looking into extensions which could support similar use cases anyway. Also please note that at the moment the framework does not handle references either (i.e. what is inside ref/ref) when using the SimpleWikiParser [1] From a quick exploration I see this template is used mainly for references. What do you exactly mean when you talk about Wikipedia HTML? Do you refer to HTML dumps of the whole wikipedia? Cheers Andrea [1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172 2013/12/3 Tom Morris tfmor...@gmail.com On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com wrote: Something I found out recently is that the page links don't capture links that are generated by macros, in particular almost all of the links to pages like http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 don't show up because they are generated by the {cite} macro. These can be easily extracted from the Wikipedia HTML of course, That's good to know, but couldn't you get this directly from the Wikimedia API without resorting to HTML parsing by asking for template calls to http://en.wikipedia.org/wiki/Template:Cite ? Tom -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254paul.houle on Skype ontol...@gmail.com -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
Re: [Dbpedia-discussion] Pagelinks dataset
The DBpedia Way of extracting the citations probably would be to build something that treats the citations the way infoboxes are treated. It's one way of doing things, and it has it's own integrity, but it's not the way I do things. (DBpedia does it this way about as well as it can be done, why try to beat it?) A few years back I wrote a very elaborate Wikipedia markup parser in .NET, it used a recursive descent parser and lots and lots of heuristics to deal with special cases. The purpose of it was to accurately parse author and licensing metadata from Wikimedia Commons when ingesting images into Ookaboo. I had to do the special cases that because Wikipedia markup doesn't have a formal spec. I quickly ran into a diminishing returns situation where I had to work harder and harder to improve recall and get deteriorating results. I later wrote a very simple parser for Flickr which just parsed the HTML and took advantage of the cool URIs published in Flickr. Today I think of it as pretending that the Linked Data revolution has already arrived, because really if you look at the link graph of Flickr, there is a subset of it which isn't very different from the link graph of Ookaboo. Anyway, I needed to pull some stuff out of Wikimedia Commons and it took me 20 minutes to modify the Flickr parser to work for Commons and get at least 80% of the recall that the old parser got. On Thu, Dec 5, 2013 at 10:29 AM, Andrea Di Menna ninn...@gmail.com wrote: @Paul, unfortunately HTML wikipedia dumps are not released anymore (they are old static dumps as you said). This is a problem for a project like DBpedia, as you can easily understand. Moreover, I did not mean that it is not possible to crawl Wikipedia instances or load dump into a private Mediawiki instance (the latter is what happens when abstracts are extracted), I am just saying that this is probably not practical for a project like DBpedia which extracts data from multiple wikipedias. Cheers Andrea 2013/12/5 Paul Houle ontolo...@gmail.com @Andrea, there are old static dumps available, but I can say that running the web crawler is not at all difficult. I got a list of topics by looking at the ?s for DBpedia descriptions and then wrote a very simple single-threaded crawler that took a few days to run on a micro instance in AWS. The main key to writing a successful web crawler is keeping it simple. On Dec 5, 2013 4:23 AM, Andrea Di Menna ninn...@gmail.com wrote: 2013/12/4 Paul Houle ontolo...@gmail.com I think I could get this data out of some API, but there are great HTML 5 parsing libraries now, so a link extractor from HTML can be built as quickly than an API client. There are two big advantages of looking at links in HTML: (i) you can use the same software to analyze multiple sites, and (ii) the HTML output is often the most tested output of a system. This is particularly a problem in the case of Wikipedia markup which has no formal specification and for which the editors aren't concerned if the markup is clean but they will fix problems if they cause the HTML to look wrong. Another advantage of HTML is that you can work from a static dump file, Where can you get such dump from? or run a web crawler against the real Wikipedia Seems not practical or against a local copy of Wikipedia loaded from the database dump files. Pretty slow, isn't it? Cheers! Andrea On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna ninn...@gmail.com wrote: I guess Paul wanted to know which book is cited by one wikipedia page (e.g. page A cites book x). If I am not wrong by asking template transclusions you only get the first part of the triple (page A). Paul, your use case is interesting. At the moment we are not dealing with the {{cite}} template nor {{cite book}} etc. We are looking into extensions which could support similar use cases anyway. Also please note that at the moment the framework does not handle references either (i.e. what is inside ref/ref) when using the SimpleWikiParser [1] From a quick exploration I see this template is used mainly for references. What do you exactly mean when you talk about Wikipedia HTML? Do you refer to HTML dumps of the whole wikipedia? Cheers Andrea [1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172 2013/12/3 Tom Morris tfmor...@gmail.com On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com wrote: Something I found out recently is that the page links don't capture links that are generated by macros, in particular almost all of the links to pages like http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 don't show up because they are
Re: [Dbpedia-discussion] Pagelinks dataset
I think I could get this data out of some API, but there are great HTML 5 parsing libraries now, so a link extractor from HTML can be built as quickly than an API client. There are two big advantages of looking at links in HTML: (i) you can use the same software to analyze multiple sites, and (ii) the HTML output is often the most tested output of a system. This is particularly a problem in the case of Wikipedia markup which has no formal specification and for which the editors aren't concerned if the markup is clean but they will fix problems if they cause the HTML to look wrong. Another advantage of HTML is that you can work from a static dump file, or run a web crawler against the real Wikipedia or against a local copy of Wikipedia loaded from the database dump files. On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna ninn...@gmail.com wrote: I guess Paul wanted to know which book is cited by one wikipedia page (e.g. page A cites book x). If I am not wrong by asking template transclusions you only get the first part of the triple (page A). Paul, your use case is interesting. At the moment we are not dealing with the {{cite}} template nor {{cite book}} etc. We are looking into extensions which could support similar use cases anyway. Also please note that at the moment the framework does not handle references either (i.e. what is inside ref/ref) when using the SimpleWikiParser [1] From a quick exploration I see this template is used mainly for references. What do you exactly mean when you talk about Wikipedia HTML? Do you refer to HTML dumps of the whole wikipedia? Cheers Andrea [1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172 2013/12/3 Tom Morris tfmor...@gmail.com On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com wrote: Something I found out recently is that the page links don't capture links that are generated by macros, in particular almost all of the links to pages like http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 don't show up because they are generated by the {cite} macro. These can be easily extracted from the Wikipedia HTML of course, That's good to know, but couldn't you get this directly from the Wikimedia API without resorting to HTML parsing by asking for template calls to http://en.wikipedia.org/wiki/Template:Cite ? Tom -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254paul.houle on Skype ontol...@gmail.com -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
Re: [Dbpedia-discussion] Pagelinks dataset
Hi Dario, the dataset you are using is extracted by the org.dbpedia.extraction.mappings.PageLinksExtractor [1]. This extractor collects internal wiki links [2] from Wikipedia content articles (that is, wikipedia pages which belong to the Main namespace [3]) to other wikipedia pages (please note I am not talking about content articles here, because also links to pages in the File or Category namespaces are collected). Each row - triple subject predicate object - in the Pagelinks represent a directed link between two pages, e.g. http://dbpedia.org/resource/Albedo http://dbpedia.org/ontology/wikiPageWikiLink http://dbpedia.org/resource/Latin . means that an internal link to http://en.wikipedia.org/wiki/Latin was found in http://en.wikipedia.org/wiki/Albedo. You can check this link exists here (first sentence) [6] Basically this can be modeled in a directed graph as an edge Albedo - Latin The reason why you have 17M instances (I suppose you are counting the nodes in your graph) is because objects in each triple can be outside the Main namespace. As far as I remember, 4M articles are wiki pages with belong to the Main namespace and which are neither redirects [4] nor disambiguation pages [5]. Hope this clarifies a bit :-) Cheers Andrea [1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/PageLinksExtractor.scala [2] https://en.wikipedia.org/wiki/Help:Link [3] https://en.wikipedia.org/wiki/Wikipedia:Main_namespace [4] https://en.wikipedia.org/wiki/Wikipedia:Redirect [5] https://en.wikipedia.org/wiki/Wikipedia:Disambiguation [6] https://en.wikipedia.org/wiki/Albedo 2013/12/2 Dario Garcia Gasulla dar...@lsi.upc.edu Hi, I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC). I'm currently doing research on very large directed graphs and I am using one of your datasets for testing. Concretly, I am using the Wikipedia Pagelinks dataset as available in the DBpedia web site. Unfortunately the description of the dataset is not very detailed: Wikipedia Pagelinks *Dataset containing internal links between DBpedia instances. The dataset was created from the internal links between Wikipedia articles. The dataset might be useful for structural analysis, data mining or for ranking DBpedia instances using Page Rank or similar algorithms.* I wonder if you could give me more information on how the dataset was built and what composes it. I understand Wikipedia has 4M articles and 31M pages, while this dataset has 17M instances and 130M links (couldn't find the number of links of Wikipedia). What's the relation between both? Could someone briefly explain the nature of the Pagelinks dataset and the differences with the Wikipedia? Thank you for your time, Dario. -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
Re: [Dbpedia-discussion] Pagelinks dataset
Something I found out recently is that the page links don't capture links that are generated by macros, in particular almost all of the links to pages like http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 don't show up because they are generated by the {cite} macro. These can be easily extracted from the Wikipedia HTML of course, which is what I did to pull off this project http://blog.databaseanimals.com/the-top-most-cited-books-in-wikipedia http://blog.databaseanimals.com/true-semantic-advertising On Tue, Dec 3, 2013 at 4:32 AM, Andrea Di Menna ninn...@gmail.com wrote: Hi Dario, the dataset you are using is extracted by the org.dbpedia.extraction.mappings.PageLinksExtractor [1]. This extractor collects internal wiki links [2] from Wikipedia content articles (that is, wikipedia pages which belong to the Main namespace [3]) to other wikipedia pages (please note I am not talking about content articles here, because also links to pages in the File or Category namespaces are collected). Each row - triple subject predicate object - in the Pagelinks represent a directed link between two pages, e.g. http://dbpedia.org/resource/Albedo http://dbpedia.org/ontology/wikiPageWikiLink http://dbpedia.org/resource/Latin . means that an internal link to http://en.wikipedia.org/wiki/Latin was found in http://en.wikipedia.org/wiki/Albedo. You can check this link exists here (first sentence) [6] Basically this can be modeled in a directed graph as an edge Albedo - Latin The reason why you have 17M instances (I suppose you are counting the nodes in your graph) is because objects in each triple can be outside the Main namespace. As far as I remember, 4M articles are wiki pages with belong to the Main namespace and which are neither redirects [4] nor disambiguation pages [5]. Hope this clarifies a bit :-) Cheers Andrea [1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/PageLinksExtractor.scala [2] https://en.wikipedia.org/wiki/Help:Link [3] https://en.wikipedia.org/wiki/Wikipedia:Main_namespace [4] https://en.wikipedia.org/wiki/Wikipedia:Redirect [5] https://en.wikipedia.org/wiki/Wikipedia:Disambiguation [6] https://en.wikipedia.org/wiki/Albedo 2013/12/2 Dario Garcia Gasulla dar...@lsi.upc.edu Hi, I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC). I'm currently doing research on very large directed graphs and I am using one of your datasets for testing. Concretly, I am using the Wikipedia Pagelinks dataset as available in the DBpedia web site. Unfortunately the description of the dataset is not very detailed: Wikipedia Pagelinks Dataset containing internal links between DBpedia instances. The dataset was created from the internal links between Wikipedia articles. The dataset might be useful for structural analysis, data mining or for ranking DBpedia instances using Page Rank or similar algorithms. I wonder if you could give me more information on how the dataset was built and what composes it. I understand Wikipedia has 4M articles and 31M pages, while this dataset has 17M instances and 130M links (couldn't find the number of links of Wikipedia). What's the relation between both? Could someone briefly explain the nature of the Pagelinks dataset and the differences with the Wikipedia? Thank you for your time, Dario. -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk ___ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254paul.houle on Skype ontol...@gmail.com -- Rapidly troubleshoot problems before they affect your
Re: [Dbpedia-discussion] Pagelinks dataset
On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com wrote: Something I found out recently is that the page links don't capture links that are generated by macros, in particular almost all of the links to pages like http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 don't show up because they are generated by the {cite} macro. These can be easily extracted from the Wikipedia HTML of course, That's good to know, but couldn't you get this directly from the Wikimedia API without resorting to HTML parsing by asking for template calls to http://en.wikipedia.org/wiki/Template:Cite ? Tom -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion