On 3/19/2011 9:47 AM, Gabriele Kahlout wrote:


On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout <[email protected] <mailto:[email protected]>> wrote:

    Hello,

    I've downloaded and wrote a simple parser to give me pedia urls
    from this dbpedia file
    <http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as
    shown below. I find the result unsatisfactory since it contains
    many duplicates. Adding logic to the parser to avoid them (through
    remembering) seems to be also very expensive, since the file size
    (uncompressed) is 3GB. Is there a better approach to get Wikipedia
    urls like is done with dmoz in


If you really want to work with all the text in Wikipedia, you'd get quicker results by using the XML dump files than you would by crawling or using the API. I just heard one person estimate that he'd need about 75 days to get everything out of Wikipedia via the API. See

http://en.wikipedia.org/wiki/Wikipedia:Database_download

note that you probably don't want the 'complete' dump which contains history is about 300 GB in size and isn't even entirely 'complete'. The 'articles' dump, on the other hand, is reasonable to handle. I've got scripts that can download it and do some light extraction on it in less than two hours.
------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to