[mwlib] Re: Processing all of wikipedia

Nick Johnson Wed, 10 Jun 2009 11:36:24 -0700

Nick Johnson wrote:
> On Jun 10, 7:13 pm, Ralf Schmitt <[email protected]> wrote:
> > On Wed, Jun 10, 2009 at 8:02 PM, Nick Johnson<[email protected]> wrote:
> >
> > > I see that the cdb format is documented as having a limitation of 4GB
> > > per CDB database. Will this be a problem processing the Wikipedia
> > > dump? If it's not 4GB yet, it certainly will be soon.
> >
> > the cdb file is only used as an index and does not store article data,
> > so it should be no problem.
>
> Hm. And more reading reveals that the actual limit in CDB appears to
> be per-record rather than per-file.
>
> More alarming is the fact that dumpparser uses a DOM-based parser
> (ElementTree). I'm fairly sure that I can't store the DOM of the
> entire of Wikipedia in memory.


Apologies again. Now that I've finished downloading and started
processing the dump, I see it's using some sort of iterative approach
I didn't know ElementTree posessed. Thanks for all your help,
though. :)

-Nick Johnson
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [email protected]
For more options, visit this group at http://groups.google.com/group/mwlib?hl=en
-~----------~----~----~----~------~----~------~--~---

[mwlib] Re: Processing all of wikipedia

Reply via email to