Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Paul Houle Mon, 21 Mar 2011 07:32:45 -0700

 On 3/19/2011 9:47 AM, Gabriele Kahlout wrote:

On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout<[email protected] <mailto:[email protected]>> wrote:


    Hello,

    I've downloaded and wrote a simple parser to give me pedia urls
    from this dbpedia file
    <http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as
    shown below. I find the result unsatisfactory since it contains
    many duplicates. Adding logic to the parser to avoid them (through
    remembering) seems to be also very expensive, since the file size
    (uncompressed) is 3GB. Is there a better approach to get Wikipedia
    urls like is done with dmoz in

If you really want to work with all the text in Wikipedia, you'dget quicker results by using the XML dump files than you would bycrawling or using the API. I just heard one person estimate that he'dneed about 75 days to get everything out of Wikipedia via the API. See


http://en.wikipedia.org/wiki/Wikipedia:Database_download

note that you probably don't want the 'complete' dump which containshistory is about 300 GB in size and isn't even entirely 'complete'. The'articles' dump, on the other hand, is reasonable to handle. I've gotscripts that can download it and do some light extraction on it in lessthan two hours.

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Reply via email to