Re: Plucking a huge set of local files

Bill Janssen Mon, 03 Jun 2002 18:37:44 -0700

> One would think that it
> shouldn't use that much more memory than the size of the actual
> contents you are trying to convert.


One would, but that would be reckoning without the actual design of
the parser.  There are about 3 copies of everything (source, an
abstract parse tree of every page which consumes an obscene amount of
memory, and a Palm binary form) in memory by the time it's just about
to write the binary file, as well as a number of additional
dictionaries of various stuff attached to each node.  My decision
after wrestling with this last fall is that a re-write of Spider.py
would help a lot, but that any further optimization would have to turn
it from a single-pass into a multi-pass compiler.

> I doubt you will be able to create a Plucker document with that
> many files. The largest Plucker document I have created has about
> 1700 records and it was a PITA to create ;-)

I create every night a PluckerDoc with about 10,000 links at the
high-water point, but only about 2,300 actual distinct small HTML
pages, each fetched via HTTP from a Web server.  It took about 22
minutes last night.  Never seen anything particularly painful about
it, except for the time it takes.

Bill

Re: Plucking a huge set of local files

Reply via email to