subject:"Re\: \[Dbpedia\-discussion\] Pagelinks dataset"

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Andrea Di Menna

2013/12/4 Paul Houle ontolo...@gmail.com

 I think I could get this data out of some API,  but there are great
 HTML 5 parsing libraries now,  so a link extractor from HTML can be
 built as quickly than an API client.

 There are two big advantages of looking at links in HTML:  (i) you can
 use the same software to analyze multiple sites,  and (ii) the HTML
 output is often the most tested output of a system.  This is
 particularly a problem in the case of Wikipedia markup which has no
 formal specification and for which the editors aren't concerned if the
 markup is clean but they will fix problems if they cause the HTML to
 look wrong.

 Another advantage of HTML is that you can work from a static dump
 file,


Where can you get such dump from?


 or run a web crawler against the real Wikipedia


Seems not practical


 or against a
 local copy of Wikipedia loaded from the database dump files.


Pretty slow, isn't it?

Cheers!
Andrea





 On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna ninn...@gmail.com wrote:
  I guess Paul wanted to know which book is cited by one wikipedia page
 (e.g.
  page A cites book x).
  If I am not wrong by asking template transclusions you only get the first
  part of the triple (page A).
 
  Paul, your use case is interesting.
  At the moment we are not dealing with the {{cite}} template nor {{cite
  book}} etc.
  We are looking into extensions which could support similar use cases
 anyway.
 
  Also please note that at the moment the framework does not handle
 references
  either (i.e. what is inside ref/ref) when using the SimpleWikiParser
 [1]
  From a quick exploration I see this template is used mainly for
 references.
 
  What do you exactly mean when you talk about Wikipedia HTML? Do you
 refer
  to HTML dumps of the whole wikipedia?
 
  Cheers
  Andrea
 
  [1]
 
 https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172
 
 
  2013/12/3 Tom Morris tfmor...@gmail.com
 
  On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com wrote:
 
  Something I found out recently is that the page links don't capture
  links that are generated by macros,  in particular almost all of the
  links to pages like
 
  http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
 
  don't show up because they are generated by the {cite} macro.  These
  can be easily extracted from the Wikipedia HTML of course,
 
 
  That's good to know, but couldn't you get this directly from the
 Wikimedia
  API without resorting to HTML parsing by asking for template calls to
  http://en.wikipedia.org/wiki/Template:Cite ?
 
  Tom
 
 



 --
 Paul Houle
 Expert on Freebase, DBpedia, Hadoop and RDF
 (607) 539 6254paul.houle on Skype   ontol...@gmail.com

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Andrea Di Menna

@Paul,

unfortunately HTML wikipedia dumps are not released anymore (they are old
static dumps as you said).
This is a problem for a project like DBpedia, as you can easily understand.

Moreover, I did not mean that it is not possible to crawl Wikipedia
instances or load dump into a private Mediawiki instance (the latter is
what happens when abstracts are extracted), I am just saying that this is
probably not practical for a project like DBpedia which extracts data from
multiple wikipedias.

Cheers
Andrea


2013/12/5 Paul Houle ontolo...@gmail.com

 @Andrea,

 there are old static dumps available,  but I can say that running
 the web crawler is not at all difficult.  I got a list of topics by looking
 at the ?s for DBpedia descriptions and then wrote a very simple
 single-threaded crawler that took a few days to run on a micro instance in
 AWS.

The main key to writing a successful web crawler is keeping it
 simple.

 On Dec 5, 2013 4:23 AM, Andrea Di Menna ninn...@gmail.com wrote:
 
  2013/12/4 Paul Houle ontolo...@gmail.com
 
  I think I could get this data out of some API,  but there are great
  HTML 5 parsing libraries now,  so a link extractor from HTML can be
  built as quickly than an API client.
 
  There are two big advantages of looking at links in HTML:  (i) you can
  use the same software to analyze multiple sites,  and (ii) the HTML
  output is often the most tested output of a system.  This is
  particularly a problem in the case of Wikipedia markup which has no
  formal specification and for which the editors aren't concerned if the
  markup is clean but they will fix problems if they cause the HTML to
  look wrong.
 
  Another advantage of HTML is that you can work from a static dump
  file,
 
 
  Where can you get such dump from?
 
 
  or run a web crawler against the real Wikipedia
 
 
  Seems not practical
 
 
  or against a
  local copy of Wikipedia loaded from the database dump files.
 
 
  Pretty slow, isn't it?
 
  Cheers!
  Andrea
 
 
 
 
 
  On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna ninn...@gmail.com
 wrote:
   I guess Paul wanted to know which book is cited by one wikipedia page
 (e.g.
   page A cites book x).
   If I am not wrong by asking template transclusions you only get the
 first
   part of the triple (page A).
  
   Paul, your use case is interesting.
   At the moment we are not dealing with the {{cite}} template nor {{cite
   book}} etc.
   We are looking into extensions which could support similar use cases
 anyway.
  
   Also please note that at the moment the framework does not handle
 references
   either (i.e. what is inside ref/ref) when using the
 SimpleWikiParser [1]
   From a quick exploration I see this template is used mainly for
 references.
  
   What do you exactly mean when you talk about Wikipedia HTML? Do you
 refer
   to HTML dumps of the whole wikipedia?
  
   Cheers
   Andrea
  
   [1]
  
 https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172
  
  
   2013/12/3 Tom Morris tfmor...@gmail.com
  
   On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com
 wrote:
  
   Something I found out recently is that the page links don't capture
   links that are generated by macros,  in particular almost all of the
   links to pages like
  
   http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
  
   don't show up because they are generated by the {cite} macro.  These
   can be easily extracted from the Wikipedia HTML of course,
  
  
   That's good to know, but couldn't you get this directly from the
 Wikimedia
   API without resorting to HTML parsing by asking for template calls to
   http://en.wikipedia.org/wiki/Template:Cite ?
  
   Tom
  
  
 
 
 
  --
  Paul Houle
  Expert on Freebase, DBpedia, Hadoop and RDF
  (607) 539 6254paul.houle on Skype   ontol...@gmail.com
 
 

--
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Paul Houle

The DBpedia Way of extracting the citations probably would be to
build something that treats the citations the way infoboxes are
treated.

It's one way of doing things, and it has it's own integrity, but
it's not the way I do things. (DBpedia does it this way about as well
as it can be done, why try to beat it?)

A few years back I wrote a very elaborate Wikipedia markup parser in
.NET, it used a recursive descent parser and lots and lots of
heuristics to deal with special cases. The purpose of it was to
accurately parse author and licensing metadata from Wikimedia Commons
when ingesting images into Ookaboo. I had to do the special cases
that because Wikipedia markup doesn't have a formal spec.

I quickly ran into a diminishing returns situation where I had to work
harder and harder to improve recall and get deteriorating results.

I later wrote a very simple parser for Flickr which just parsed the
HTML and took advantage of the cool URIs published in Flickr. Today
I think of it as pretending that the Linked Data revolution has
already arrived, because really if you look at the link graph of
Flickr, there is a subset of it which isn't very different from the
link graph of Ookaboo.

Anyway, I needed to pull some stuff out of Wikimedia Commons and it
took me 20 minutes to modify the Flickr parser to work for Commons and
get at least 80% of the recall that the old parser got.

On Thu, Dec 5, 2013 at 10:29 AM, Andrea Di Menna ninn...@gmail.com wrote:
@Paul,

unfortunately HTML wikipedia dumps are not released anymore (they are old
static dumps as you said).
This is a problem for a project like DBpedia, as you can easily understand.

Moreover, I did not mean that it is not possible to crawl Wikipedia
instances or load dump into a private Mediawiki instance (the latter is what
happens when abstracts are extracted), I am just saying that this is
probably not practical for a project like DBpedia which extracts data from
multiple wikipedias.

Cheers
Andrea

2013/12/5 Paul Houle ontolo...@gmail.com

@Andrea,

there are old static dumps available, but I can say that running
the web crawler is not at all difficult. I got a list of topics by looking
at the ?s for DBpedia descriptions and then wrote a very simple
single-threaded crawler that took a few days to run on a micro instance in
AWS.

The main key to writing a successful web crawler is keeping it
simple.

On Dec 5, 2013 4:23 AM, Andrea Di Menna ninn...@gmail.com wrote:

2013/12/4 Paul Houle ontolo...@gmail.com

I think I could get this data out of some API, but there are great
HTML 5 parsing libraries now, so a link extractor from HTML can be
built as quickly than an API client.

There are two big advantages of looking at links in HTML: (i) you can
use the same software to analyze multiple sites, and (ii) the HTML
output is often the most tested output of a system. This is
particularly a problem in the case of Wikipedia markup which has no
formal specification and for which the editors aren't concerned if the
markup is clean but they will fix problems if they cause the HTML to
look wrong.

Another advantage of HTML is that you can work from a static dump
file,

Where can you get such dump from?

or run a web crawler against the real Wikipedia

Seems not practical

or against a
local copy of Wikipedia loaded from the database dump files.

Pretty slow, isn't it?

Cheers!
Andrea

On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna ninn...@gmail.com
wrote:
I guess Paul wanted to know which book is cited by one wikipedia page
(e.g.
page A cites book x).
If I am not wrong by asking template transclusions you only get the
first
part of the triple (page A).

Paul, your use case is interesting.
At the moment we are not dealing with the {{cite}} template nor
{{cite
book}} etc.
We are looking into extensions which could support similar use cases
anyway.

Also please note that at the moment the framework does not handle
references
either (i.e. what is inside ref/ref) when using the
SimpleWikiParser [1]
From a quick exploration I see this template is used mainly for
references.

What do you exactly mean when you talk about Wikipedia HTML? Do you
refer
to HTML dumps of the whole wikipedia?

Cheers
Andrea

[1]

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172

2013/12/3 Tom Morris tfmor...@gmail.com

On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com
wrote:

Something I found out recently is that the page links don't capture
links that are generated by macros, in particular almost all of
the
links to pages like

http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1

don't show up because they are

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-04 Thread Paul Houle

I think I could get this data out of some API, but there are great
HTML 5 parsing libraries now, so a link extractor from HTML can be
built as quickly than an API client.

Another advantage of HTML is that you can work from a static dump
file, or run a web crawler against the real Wikipedia or against a
local copy of Wikipedia loaded from the database dump files.

On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna ninn...@gmail.com wrote:
I guess Paul wanted to know which book is cited by one wikipedia page (e.g.
page A cites book x).
If I am not wrong by asking template transclusions you only get the first
part of the triple (page A).

Paul, your use case is interesting.
At the moment we are not dealing with the {{cite}} template nor {{cite
book}} etc.
We are looking into extensions which could support similar use cases anyway.

Also please note that at the moment the framework does not handle references
either (i.e. what is inside ref/ref) when using the SimpleWikiParser [1]
From a quick exploration I see this template is used mainly for references.

What do you exactly mean when you talk about Wikipedia HTML? Do you refer
to HTML dumps of the whole wikipedia?

Cheers
Andrea

[1]
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172

2013/12/3 Tom Morris tfmor...@gmail.com

On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com wrote:

Something I found out recently is that the page links don't capture
links that are generated by macros, in particular almost all of the
links to pages like

http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1

don't show up because they are generated by the {cite} macro. These
can be easily extracted from the Wikipedia HTML of course,

That's good to know, but couldn't you get this directly from the Wikimedia
API without resorting to HTML parsing by asking for template calls to
http://en.wikipedia.org/wiki/Template:Cite ?

Tom

--
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype ontol...@gmail.com

--
Sponsored by Intel(R) XDK
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk
___
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-03 Thread Andrea Di Menna

Hi Dario,

the dataset you are using is extracted by
the org.dbpedia.extraction.mappings.PageLinksExtractor [1].
This extractor collects internal wiki links [2] from Wikipedia content
articles (that is, wikipedia pages which belong to the Main namespace [3])
to other wikipedia pages (please note I am not talking about content
articles here, because also links to pages in the File or Category
namespaces are collected).

Each row - triple subject predicate object - in the Pagelinks
represent a directed link between two pages, e.g.

http://dbpedia.org/resource/Albedo
http://dbpedia.org/ontology/wikiPageWikiLink
http://dbpedia.org/resource/Latin .

means that an internal link to http://en.wikipedia.org/wiki/Latin was
found in http://en.wikipedia.org/wiki/Albedo.

You can check this link exists here (first sentence) [6]

Basically this can be modeled in a directed graph as an edge Albedo - Latin

The reason why you have 17M instances (I suppose you are counting the nodes
in your graph) is because objects in each triple can be outside the Main
namespace.
As far as I remember, 4M articles are wiki pages with belong to the Main
namespace and which are neither redirects [4] nor disambiguation pages [5].

Hope this clarifies a bit :-)

Cheers
Andrea

[1]
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/PageLinksExtractor.scala
[2] https://en.wikipedia.org/wiki/Help:Link
[3] https://en.wikipedia.org/wiki/Wikipedia:Main_namespace
[4] https://en.wikipedia.org/wiki/Wikipedia:Redirect
[5] https://en.wikipedia.org/wiki/Wikipedia:Disambiguation
[6] https://en.wikipedia.org/wiki/Albedo

2013/12/2 Dario Garcia Gasulla dar...@lsi.upc.edu

Hi,

I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC).

I'm currently doing research on very large directed graphs and I am using
one of your datasets for testing. Concretly, I am using the Wikipedia
Pagelinks dataset as available in the DBpedia web site.

Unfortunately the description of the dataset is not very detailed:
Wikipedia Pagelinks *Dataset containing internal links between DBpedia
instances. The dataset was created from the internal links between
Wikipedia articles. The dataset might be useful for structural analysis,
data mining or for ranking DBpedia instances using Page Rank or similar
algorithms.*

I wonder if you could give me more information on how the dataset was
built and what composes it.
I understand Wikipedia has 4M articles and 31M pages, while this dataset
has 17M instances and 130M links (couldn't find the number of links of
Wikipedia).

What's the relation between both? Could someone briefly explain the nature
of the Pagelinks dataset and the differences with the Wikipedia?

Thank you for your time,
Dario.

--
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics
Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

--
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-03 Thread Paul Houle

Something I found out recently is that the page links don't capture
links that are generated by macros, in particular almost all of the
links to pages like

http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1

don't show up because they are generated by the {cite} macro. These
can be easily extracted from the Wikipedia HTML of course, which is
what I did to pull off this project

http://blog.databaseanimals.com/the-top-most-cited-books-in-wikipedia
http://blog.databaseanimals.com/true-semantic-advertising

On Tue, Dec 3, 2013 at 4:32 AM, Andrea Di Menna ninn...@gmail.com wrote:
Hi Dario,

the dataset you are using is extracted by the
org.dbpedia.extraction.mappings.PageLinksExtractor [1].
This extractor collects internal wiki links [2] from Wikipedia content
articles (that is, wikipedia pages which belong to the Main namespace [3])
to other wikipedia pages (please note I am not talking about content
articles here, because also links to pages in the File or Category
namespaces are collected).

Each row - triple subject predicate object - in the Pagelinks
represent a directed link between two pages, e.g.

http://dbpedia.org/resource/Albedo
http://dbpedia.org/ontology/wikiPageWikiLink
http://dbpedia.org/resource/Latin .

means that an internal link to http://en.wikipedia.org/wiki/Latin was found
in http://en.wikipedia.org/wiki/Albedo.

You can check this link exists here (first sentence) [6]

Basically this can be modeled in a directed graph as an edge Albedo -
Latin

Hope this clarifies a bit :-)

Cheers
Andrea

2013/12/2 Dario Garcia Gasulla dar...@lsi.upc.edu

Hi,

I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC).

I'm currently doing research on very large directed graphs and I am using
one of your datasets for testing. Concretly, I am using the Wikipedia
Pagelinks dataset as available in the DBpedia web site.

Unfortunately the description of the dataset is not very detailed:

Wikipedia Pagelinks

Dataset containing internal links between DBpedia instances. The dataset
was created from the internal links between Wikipedia articles. The dataset
might be useful for structural analysis, data mining or for ranking DBpedia
instances using Page Rank or similar algorithms.

What's the relation between both? Could someone briefly explain the nature
of the Pagelinks dataset and the differences with the Wikipedia?

Thank you for your time,
Dario.

http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk
___
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

--
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype ontol...@gmail.com

--
Rapidly troubleshoot problems before they affect your

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-03 Thread Tom Morris

On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle ontolo...@gmail.com wrote:

 Something I found out recently is that the page links don't capture
 links that are generated by macros,  in particular almost all of the
 links to pages like

 http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1

 don't show up because they are generated by the {cite} macro.  These
 can be easily extracted from the Wikipedia HTML of course,


That's good to know, but couldn't you get this directly from the Wikimedia
API without resorting to HTML parsing by asking for template calls to
http://en.wikipedia.org/wiki/Template:Cite ?

Tom
--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

Re: [Dbpedia-discussion] Pagelinks dataset

Re: [Dbpedia-discussion] Pagelinks dataset

Re: [Dbpedia-discussion] Pagelinks dataset

Re: [Dbpedia-discussion] Pagelinks dataset

Re: [Dbpedia-discussion] Pagelinks dataset

Re: [Dbpedia-discussion] Pagelinks dataset

7 matches

Site Navigation

Mail list logo

Footer information