>       Can't we cache these entries off to disk, even in one flat file
> (Berk dbm?) and have pointers to the indices to the records into this file?

Sure, but then you have to devise a storage format for that file, and
write it, and read it, and so forth.  Using the Python shelve class
would be the simplest, I think -- this does pickling onto a dbm DB, so
that you really don't have to invent an externalization form.

> When the parser finishes, unroll these entries into the final pdb, and
> unlink the file(s) from disk. Having to store a recursive array in memory of
> this magnitude is going to really hurt as we scale to parallel gathering (if
> Python/urllib2 can handle this).

No worse than the current design, I think.

>       How about we re-think the design, instead of try to optimize the
> existing design? We've all got great ideas around how this "should" work,
> and we've all written parsers on our own. It would really benefit to
> standardize on some common design elements and implementations across the
> languages in a "2.0" rewrite of the parser family.

Spending time on making sure we can process 25000 node documents in
some given time/space bounds is surely worthy, but seems low on the
priority scale to me.  I'd rather focus on functional improvements to
Plucker, like adding CSS and XHTML support.  Again, I think that just
re-doing the current main loop in Spider.py, and the associated data
structures, would fix a lot of the current bloat.

The parser family?  I wonder if having more than one parser would be a
good idea, considering the problems that having more than one image
parser has given us.

Tell you what -- I'll try making a few explicit garbage collection mods
to Spider, and see what we can get.  I think I can at least cut it down
from having three versions in memory to only two -- and maybe even one
at a time.

Bill

Reply via email to