Guylhem, This is a fascinating example! Thanks so much for bringing it to our attention. There are a number of things going on here, but I think poking at this will help produce some real improvements in the parser.
One of the major things this illustrates is that breadth-first parsing is a bad idea for some (probably most) HTML clusters. Changing the algorithm from breadth-first to depth-first (by changing 2 characters in the Spider.py file :-) made an amazing amount of difference. You go from generally "9000 collected, 400000 to do", to generally "15000 collected, 52 to do". I think I'm going to leave the default behavior at depth-first, with a switch to do breadth-first if desired. By the way, I'm still not convinced that there's a memory "leak". The parser does grow steadily while parsing this, but on the other hand, there's also 226 MB of HTML there to be parsed, full of very short paragraphs and hundreds of thousands of links. Considering that we're writing style information at the beginning of every paragraph, and that we use a very abstract syntax composed of objects for holding the parsed form of each page, and that we keep all the pages in memory till we finish the whole set, I wouldn't be surprised if the amount of memory used to store the parse trees was 5-10 times the the size of the HTML -- that would be about 2.3 GB, if it was really 10x. We might be able to shrink the bloat, by modifying the way PyPlucker.PluckerDocs.PluckerTextParagraph (enough "Plucker" in that for you?) stores the parsed form. So, I'm curious. Suppose I succeeded in modifying the parser to make it possible to reduce this 226 MB of HTML down to, say, a 23 MB (just guessing) Plucker document. What would you do with it? Put it on a memory card and open it with VFS? Would the viewer handle that? Bill > On Mon, 3 Jun 2002, David A. Desrosiers wrote: > > There's a huge memory leak here, I want to nail it. > > The further I went, on a huge company machine with a reduced set of files, > was writing to the pdb file after parsing. But then again it was taking > more and more memory and failed. > > IMHO, the memory leak is not only with the spider - you may encounter it > again later when the files are written into biam.pdb
