Re: text/html assumptions, and slurping huge files

Hrvoje Niksic Wed, 01 Aug 2007 08:30:14 -0700

Micah Cowan <[EMAIL PROTECTED]> writes:

> Yes, but when mmap()ping with MEM_PRIVATE, once you actually start
> _using_ the mapped space, is there much of a difference?


As long as you don't write to the mapped region, there should be no
difference between shared and private mapped space -- that's what copy
on write (explicitly documented for MAP_PRIVATE in both Linux and
Solaris mmap man pages) is supposed to accomplish.  I could have used
MAP_SHARED, but at the time I believe there was still code that relied
on being able to write to the buffer.  That code was subsequently
removed, but MAP_PRIVATE stayed because I saw no point in removing it.
Given the semantics of copy on write, I figured there would be no
difference between MAP_SHARED and unwritten-to MAP_PRIVATE.

As for the memory footprint getting large, sure, Wget reads through it
all, but that is no different from what, say, grep --mmap does.  As
long as we don't jump backwards in the file, the OS can swap out the
unused parts.  Another difference between mmap and malloc is that
mmap'ed space can be reliably returned to the system.  Using mmap
pretty much guarantees that Wget's footprint won't increase to 1GB
unless you're actually reading a 1GB file, and even then much less
will be resident.

> mmap() isn't failing; but wget's memory space gets huge through the
> simple use of memchr() (on '<', for instance) on the mapped address
> space.

Wget's virtual memory footprint does get huge, but the resident memory
needn't.  memchr only accesses memory sequentially, so the above swap
out scenario applies.  More importantly, in this case the report
documents "failing to allocate -2147483648 bytes", which is a malloc
or realloc error, completely unrelated to mapped files.

> Still, perhaps a better way to approach this would be to use some
> sort of heuristic to determine whether the file looks to be
> HTML. Doing this reliably without breaking real HTML files will be
> something of a challenge, but perhaps requiring that we find
> something that looks like a familiar HTML tag within the first 1k or
> so would be appropriate. We can't expect well-formed HTML, of
> course, so even requiring an <HTML> tag is not reasonable: but
> finding any tag whatsoever would be something to start with.

I agree in principle, but I'd still like to know exactly what went
wrong in the reported case.  I suspect it's not just a case of
mmapping a huge file, but a case of misparsing it, for example by
attempting to extract a "URL" hundreds of megabytes' long.

Re: text/html assumptions, and slurping huge files

Reply via email to