Re: [Dbpedia-discussion] Pagelinks dataset

Paul Houle Thu, 05 Dec 2013 07:22:58 -0800

@Andrea,

        there are old static dumps available,  but I can say that running
the web crawler is not at all difficult.  I got a list of topics by looking
at the ?s for DBpedia descriptions and then wrote a very simple
single-threaded crawler that took a few days to run on a micro instance in
AWS.


       The main key to writing a successful web crawler is keeping it
simple.

On Dec 5, 2013 4:23 AM, "Andrea Di Menna" <ninn...@gmail.com> wrote:
>
> 2013/12/4 Paul Houle <ontolo...@gmail.com>
>>
>> I think I could get this data out of some API,  but there are great
>> HTML 5 parsing libraries now,  so a link extractor from HTML can be
>> built as quickly than an API client.
>>
>> There are two big advantages of looking at links in HTML:  (i) you can
>> use the same software to analyze multiple sites,  and (ii) the HTML
>> output is often the most tested output of a system.  This is
>> particularly a problem in the case of Wikipedia markup which has no
>> formal specification and for which the editors aren't concerned if the
>> markup is clean but they will fix problems if they cause the HTML to
>> look wrong.
>>
>> Another advantage of HTML is that you can work from a static dump
>> file,
>
>
> Where can you get such dump from?
>
>>
>> or run a web crawler against the real Wikipedia
>
>
> Seems not practical
>
>>
>> or against a
>> local copy of Wikipedia loaded from the database dump files.
>
>
> Pretty slow, isn't it?
>
> Cheers!
> Andrea
>
>>
>>
>>
>>
>> On Tue, Dec 3, 2013 at 2:30 PM, Andrea Di Menna <ninn...@gmail.com>
wrote:
>> > I guess Paul wanted to know which book is cited by one wikipedia page
(e.g.
>> > page A cites book x).
>> > If I am not wrong by asking template transclusions you only get the
first
>> > part of the triple (page A).
>> >
>> > Paul, your use case is interesting.
>> > At the moment we are not dealing with the {{cite}} template nor {{cite
>> > book}} etc.
>> > We are looking into extensions which could support similar use cases
anyway.
>> >
>> > Also please note that at the moment the framework does not handle
references
>> > either (i.e. what is inside <ref></ref>) when using the
SimpleWikiParser [1]
>> > From a quick exploration I see this template is used mainly for
references.
>> >
>> > What do you exactly mean when you talk about "Wikipedia HTML"? Do you
refer
>> > to HTML dumps of the whole wikipedia?
>> >
>> > Cheers
>> > Andrea
>> >
>> > [1]
>> >
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/impl/simple/SimpleWikiParser.scala#L172
>> >
>> >
>> > 2013/12/3 Tom Morris <tfmor...@gmail.com>
>> >>
>> >> On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle <ontolo...@gmail.com>
wrote:
>> >>>
>> >>> Something I found out recently is that the page links don't capture
>> >>> links that are generated by macros,  in particular almost all of the
>> >>> links to pages like
>> >>>
>> >>> http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
>> >>>
>> >>> don't show up because they are generated by the {cite} macro.  These
>> >>> can be easily extracted from the Wikipedia HTML of course,
>> >>
>> >>
>> >> That's good to know, but couldn't you get this directly from the
Wikimedia
>> >> API without resorting to HTML parsing by asking for template calls to
>> >> http://en.wikipedia.org/wiki/Template:Cite ?
>> >>
>> >> Tom
>> >
>> >
>>
>>
>>
>> --
>> Paul Houle
>> Expert on Freebase, DBpedia, Hadoop and RDF
>> (607) 539 6254    paul.houle on Skype   ontol...@gmail.com
>
>

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Pagelinks dataset

Reply via email to