Nick Johnson wrote: > On Jun 10, 7:13 pm, Ralf Schmitt <[email protected]> wrote: > > On Wed, Jun 10, 2009 at 8:02 PM, Nick Johnson<[email protected]> wrote: > > > > > I see that the cdb format is documented as having a limitation of 4GB > > > per CDB database. Will this be a problem processing the Wikipedia > > > dump? If it's not 4GB yet, it certainly will be soon. > > > > the cdb file is only used as an index and does not store article data, > > so it should be no problem. > > Hm. And more reading reveals that the actual limit in CDB appears to > be per-record rather than per-file. > > More alarming is the fact that dumpparser uses a DOM-based parser > (ElementTree). I'm fairly sure that I can't store the DOM of the > entire of Wikipedia in memory.
Apologies again. Now that I've finished downloading and started processing the dump, I see it's using some sort of iterative approach I didn't know ElementTree posessed. Thanks for all your help, though. :) -Nick Johnson --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mwlib" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mwlib?hl=en -~----------~----~----~----~------~----~------~--~---
