> Can't we cache these entries off to disk, even in one flat file > (Berk dbm?) and have pointers to the indices to the records into this file?
Sure, but then you have to devise a storage format for that file, and write it, and read it, and so forth. Using the Python shelve class would be the simplest, I think -- this does pickling onto a dbm DB, so that you really don't have to invent an externalization form. > When the parser finishes, unroll these entries into the final pdb, and > unlink the file(s) from disk. Having to store a recursive array in memory of > this magnitude is going to really hurt as we scale to parallel gathering (if > Python/urllib2 can handle this). No worse than the current design, I think. > How about we re-think the design, instead of try to optimize the > existing design? We've all got great ideas around how this "should" work, > and we've all written parsers on our own. It would really benefit to > standardize on some common design elements and implementations across the > languages in a "2.0" rewrite of the parser family. Spending time on making sure we can process 25000 node documents in some given time/space bounds is surely worthy, but seems low on the priority scale to me. I'd rather focus on functional improvements to Plucker, like adding CSS and XHTML support. Again, I think that just re-doing the current main loop in Spider.py, and the associated data structures, would fix a lot of the current bloat. The parser family? I wonder if having more than one parser would be a good idea, considering the problems that having more than one image parser has given us. Tell you what -- I'll try making a few explicit garbage collection mods to Spider, and see what we can get. I think I can at least cut it down from having three versions in memory to only two -- and maybe even one at a time. Bill
