On Jun 10, 7:13 pm, Ralf Schmitt <[email protected]> wrote: > On Wed, Jun 10, 2009 at 8:02 PM, Nick Johnson<[email protected]> wrote: > > > I see that the cdb format is documented as having a limitation of 4GB > > per CDB database. Will this be a problem processing the Wikipedia > > dump? If it's not 4GB yet, it certainly will be soon. > > the cdb file is only used as an index and does not store article data, > so it should be no problem.
Hm. And more reading reveals that the actual limit in CDB appears to be per-record rather than per-file. More alarming is the fact that dumpparser uses a DOM-based parser (ElementTree). I'm fairly sure that I can't store the DOM of the entire of Wikipedia in memory. -Nick Johnson --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mwlib" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mwlib?hl=en -~----------~----~----~----~------~----~------~--~---
