On 3/19/2011 9:47 AM, Gabriele Kahlout wrote:
On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout
<[email protected] <mailto:[email protected]>> wrote:
Hello,
I've downloaded and wrote a simple parser to give me pedia urls
from this dbpedia file
<http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as
shown below. I find the result unsatisfactory since it contains
many duplicates. Adding logic to the parser to avoid them (through
remembering) seems to be also very expensive, since the file size
(uncompressed) is 3GB. Is there a better approach to get Wikipedia
urls like is done with dmoz in
If you really want to work with all the text in Wikipedia, you'd
get quicker results by using the XML dump files than you would by
crawling or using the API. I just heard one person estimate that he'd
need about 75 days to get everything out of Wikipedia via the API. See
http://en.wikipedia.org/wiki/Wikipedia:Database_download
note that you probably don't want the 'complete' dump which contains
history is about 300 GB in size and isn't even entirely 'complete'. The
'articles' dump, on the other hand, is reasonable to handle. I've got
scripts that can download it and do some light extraction on it in less
than two hours.
------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion