Micah Cowan <[EMAIL PROTECTED]> writes: > I agree that it's probably a good idea to move HTML parsing to a model > that doesn't require slurping everything into memory;
Note that Wget mmaps the file whenever possible, so it's not actually allocated on the heap (slurped). You need some memory to store the URLs found in the file, but that's not really avoidable. I agree that it would be better to completely avoid the memory-based model, as it would allow links to be extracted on-the-fly, without saving the file at all. It would be an interesting excercise to write or integrate a parser that works like that. Regarding limits to file size, I don't think they are a good idea. Whichever limit one chooses, someone will find a valid use case broken by the limit. Even an arbitrary limit I thought entirely reasonable, such as the maximum redirection count, recently turned out to be broken by design. In this case it might make sense to investigate exactly where and why the HTML parser spends the memory; perhaps the parser saw something it thought was valid HTML and tried to extract a huge "link" from it? Maybe the parser simply needs to be taught to perform sanity checks on URLs it encounters.