Re: Spider.py debugging

Bill Janssen Wed, 05 Jun 2002 20:14:35 -0700

Guylhem,

This is a fascinating example!  Thanks so much for bringing it to our
attention.  There are a number of things going on here, but I think
poking at this will help produce some real improvements in the parser.

One of the major things this illustrates is that breadth-first parsing
is a bad idea for some (probably most) HTML clusters.  Changing the
algorithm from breadth-first to depth-first (by changing 2 characters
in the Spider.py file :-) made an amazing amount of difference.  You
go from generally "9000 collected, 400000 to do", to generally "15000
collected, 52 to do".  I think I'm going to leave the default behavior
at depth-first, with a switch to do breadth-first if desired.

By the way, I'm still not convinced that there's a memory "leak".  The
parser does grow steadily while parsing this, but on the other hand,
there's also 226 MB of HTML there to be parsed, full of very short
paragraphs and hundreds of thousands of links.  Considering that we're
writing style information at the beginning of every paragraph, and
that we use a very abstract syntax composed of objects for holding the
parsed form of each page, and that we keep all the pages in memory
till we finish the whole set, I wouldn't be surprised if the amount of
memory used to store the parse trees was 5-10 times the the size of
the HTML -- that would be about 2.3 GB, if it was really 10x.  We
might be able to shrink the bloat, by modifying the way
PyPlucker.PluckerDocs.PluckerTextParagraph (enough "Plucker" in that
for you?) stores the parsed form.

So, I'm curious.  Suppose I succeeded in modifying the parser to make
it possible to reduce this 226 MB of HTML down to, say, a 23 MB (just
guessing) Plucker document.  What would you do with it?  Put it on a
memory card and open it with VFS?  Would the viewer handle that?

Bill

> On Mon, 3 Jun 2002, David A. Desrosiers wrote:
> >     There's a huge memory leak here, I want to nail it.
> 
> The further I went, on a huge company machine with a reduced set of files, 
> was writing to the pdb file after parsing. But then again it was taking 
> more and more memory and failed.
> 
> IMHO, the memory leak is not only with the spider - you may encounter it 
> again later when the files are written into biam.pdb

Re: Spider.py debugging

Reply via email to