Re: [Dbpedia-discussion] Pagelinks dataset

Andrea Di Menna Thu, 05 Dec 2013 07:40:00 -0800

@Paul,

unfortunately HTML wikipedia dumps are not released anymore (they are old
static dumps as you said).
This is a problem for a project like DBpedia, as you can easily understand.


Moreover, I did not mean that it is not possible to crawl Wikipedia
instances or load dump into a private Mediawiki instance (the latter is
what happens when abstracts are extracted), I am just saying that this is
probably not practical for a project like DBpedia which extracts data from
multiple wikipedias.

Cheers
Andrea


2013/12/5 Paul Houle <ontolo...@gmail.com>

> @Andrea,
>
>         there are old static dumps available,  but I can say that running
> the web crawler is not at all difficult.  I got a list of topics by looking
> at the ?s for DBpedia descriptions and then wrote a very simple
> single-threaded crawler that took a few days to run on a micro instance in
> AWS.
>
>        The main key to writing a successful web crawler is keeping it
> simple.
>
> On Dec 5, 2013 4:23 AM, "Andrea Di Menna" <ninn...@gmail.com> wrote:
> >
> > 2013/12/4 Paul Houle <ontolo...@gmail.com>
> >>
> >> I think I could get this data out of some API,  but there are great
> >> HTML 5 parsing libraries now,  so a link extractor from HTML can be
> >> built as quickly than an API client.
> >>
> >> There are two big advantages of looking at links in HTML:  (i) you can
> >> use the same software to analyze multiple sites,  and (ii) the HTML
> >> output is often the most tested output of a system.  This is
> >> particularly a problem in the case of Wikipedia markup which has no
> >> formal specification and for which the editors aren't concerned if the
> >> markup is clean but they will fix problems if they cause the HTML to
> >> look wrong.
> >>
> >> Another advantage of HTML is that you can work from a static dump
> >> file,
> >
> >
> > Where can you get such dump from?
> >
> >>
> >> or run a web crawler against the real Wikipedia
> >
> >
> > Seems not practical
> >
> >>
> >> or against a
> >> local copy of Wikipedia loaded from the database dump files.
> >
> >
> > Pretty slow, isn't it?
> >
> > Cheers!
> > Andrea
> >
> >>
> >>
> >>
> >>
> >> On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna <ninn...@gmail.com>
> wrote:
> >> > I guess Paul wanted to know which book is cited by one wikipedia page
> (e.g.
> >> > page A cites book x).
> >> > If I am not wrong by asking template transclusions you only get the
> first
> >> > part of the triple (page A).
> >> >
> >> > Paul, your use case is interesting.
> >> > At the moment we are not dealing with the {{cite}} template nor {{cite
> >> > book}} etc.
> >> > We are looking into extensions which could support similar use cases
> anyway.
> >> >
> >> > Also please note that at the moment the framework does not handle
> references
> >> > either (i.e. what is inside <ref></ref>) when using the
> SimpleWikiParser [1]
> >> > From a quick exploration I see this template is used mainly for
> references.
> >> >
> >> > What do you exactly mean when you talk about "Wikipedia HTML"? Do you
> refer
> >> > to HTML dumps of the whole wikipedia?
> >> >
> >> > Cheers
> >> > Andrea
> >> >
> >> > [1]
> >> >
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172
> >> >
> >> >
> >> > 2013/12/3 Tom Morris <tfmor...@gmail.com>
> >> >>
> >> >> On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle <ontolo...@gmail.com>
> wrote:
> >> >>>
> >> >>> Something I found out recently is that the page links don't capture
> >> >>> links that are generated by macros,  in particular almost all of the
> >> >>> links to pages like
> >> >>>
> >> >>> http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
> >> >>>
> >> >>> don't show up because they are generated by the {cite} macro.  These
> >> >>> can be easily extracted from the Wikipedia HTML of course,
> >> >>
> >> >>
> >> >> That's good to know, but couldn't you get this directly from the
> Wikimedia
> >> >> API without resorting to HTML parsing by asking for template calls to
> >> >> http://en.wikipedia.org/wiki/Template:Cite ?
> >> >>
> >> >> Tom
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Paul Houle
> >> Expert on Freebase, DBpedia, Hadoop and RDF
> >> (607) 539 6254    paul.houle on Skype   ontol...@gmail.com
> >
> >
>

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

Reply via email to