Re: text/html assumptions, and slurping huge files

Hrvoje Niksic Wed, 01 Aug 2007 00:03:37 -0700

Micah Cowan <[EMAIL PROTECTED]> writes:

> I agree that it's probably a good idea to move HTML parsing to a model
> that doesn't require slurping everything into memory;


Note that Wget mmaps the file whenever possible, so it's not actually
allocated on the heap (slurped).  You need some memory to store the
URLs found in the file, but that's not really avoidable.  I agree that
it would be better to completely avoid the memory-based model, as it
would allow links to be extracted on-the-fly, without saving the file
at all.  It would be an interesting excercise to write or integrate a
parser that works like that.

Regarding limits to file size, I don't think they are a good idea.
Whichever limit one chooses, someone will find a valid use case broken
by the limit.  Even an arbitrary limit I thought entirely reasonable,
such as the maximum redirection count, recently turned out to be
broken by design.  In this case it might make sense to investigate
exactly where and why the HTML parser spends the memory; perhaps the
parser saw something it thought was valid HTML and tried to extract a
huge "link" from it?  Maybe the parser simply needs to be taught to
perform sanity checks on URLs it encounters.

Re: text/html assumptions, and slurping huge files

Reply via email to