Re: [Clamav-devel] scanning large zip files -> file loaded entirely into (RSS) memory during extraction?

2011-12-12 Thread aCaB
On 12/12/11 21:31, Bram wrote:
> * a zip file which contains 4 files of 100 MB.
> * max-filesize set to 1GB
> * max-scansize set to 4GB

> What is the expected value of the RSS during the scan?

About MIN(zip_size(100MB), 1GB, 4GB), where zip_size() is the size of
the compressed stream.

Hope it's clear now.

Cheers,
--aCaB
___
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


Re: [Clamav-devel] scanning large zip files -> file loaded entirely into (RSS) memory during extraction?

2011-12-12 Thread Bram

Is it expected that the RSS memory increases with approximately the size
of the zip file before extracting it?


No. The RSS memory size will increase at most by the smaller of
- compressed internal file size
- max-filesize (the file is skipped if the value's exceeded)
- max-scansize (the scan is aborted if the value's exceeded)


Can you clarify 'compressed internal file size' a bit further?
Assume:
* a zip file which contains 4 files of 100 MB.
* max-filesize set to 1GB
* max-scansize set to 4GB
* RSS before the scan +/- 115 MB.

What is the expected value of the RSS during the scan?
* somewhere between 100MB and 200MB or
* somewhere between 200MB and 300MB or
* somewhere between 300MB and 400MB or
* somewhere between 400MB and 500MB or
* somewhere between 500MB and 600MB?


Best regards,

Bram


___
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


Re: [Clamav-devel] scanning large zip files -> file loaded entirely into (RSS) memory during extraction?

2011-12-12 Thread aCaB
On 12/12/11 15:34, Bram wrote:
> Is this implicit relation only for zip files or also for other
> compression/archive formats? (I tested bzip2 and gzip and there the RSS
> does not increase - or at least not that much)

I'm sure it affects other formats too. But I can't tell which on the top
of my mind, sorry.

> Just to make sure: what part of the code would need refactoring?

The unz() function.

Cheers,
--aCaB
___
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


Re: [Clamav-devel] scanning large zip files -> file loaded entirely into (RSS) memory during extraction?

2011-12-12 Thread Bram

Quoting aCaB :



Is this necessary? (Tools such as unzip are able to decompress the file
without loading the entire file into memory).


Not necessary but it makes the code much simpler in the default case
scenario as, with the default settings, the allocated memory is way
below CLI_MAX_ALLOCATION.
I agree that the implicit relation between the alloc size and the
max-(scan|file)size setting could be better documented.


Is this implicit relation only for zip files or also for other  
compression/archive formats? (I tested bzip2 and gzip and there the  
RSS does not increase - or at least not that much)



But frankly I don't see a point in refactoring the code to handle
insanely large values of max-XXXsize in a way that is memory efficient.
After all we're targeting malware which is on avg 2-300 KB in size.


Just to make sure: what part of the code would need refactoring? I  
*assume* it's (only) the decompression code and not the rest of the  
code since the extraction code creates a temp file (which, I assume,  
is later mmap'ed)



Best regards,

Bram


___
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


Re: [Clamav-devel] scanning large zip files -> file loaded entirely into (RSS) memory during extraction?

2011-12-12 Thread aCaB
On 12/12/11 09:31, Bram wrote:
> Questions:
> Is it expected that the RSS memory increases with approximately the size
> of the zip file before extracting it?

No. The RSS memory size will increase at most by the smaller of
- compressed internal file size
- max-filesize (the file is skipped if the value's exceeded)
- max-scansize (the scan is aborted if the value's exceeded)

> Is this necessary? (Tools such as unzip are able to decompress the file
> without loading the entire file into memory).

Not necessary but it makes the code much simpler in the default case
scenario as, with the default settings, the allocated memory is way
below CLI_MAX_ALLOCATION.
I agree that the implicit relation between the alloc size and the
max-(scan|file)size setting could be better documented.
But frankly I don't see a point in refactoring the code to handle
insanely large values of max-XXXsize in a way that is memory efficient.
After all we're targeting malware which is on avg 2-300 KB in size.

> What is the purpose of fmap/fmap_need_ptr_once?

That's basically an alloc and read.

HtH,
--aCaB
___
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


[Clamav-devel] scanning large zip files -> file loaded entirely into (RSS) memory during extraction?

2011-12-12 Thread Bram

Hi,

I'm having a 'problem' with scanning a large zip file.
What appears to be happening is that clamav loads the entire file into  
memory (RSS not VSZ) and then unzips the file.


The increase in VSZ memory is expected since the file is mmap'ed; the  
increase in RSS memory is/was not expected (by me).
When scanning a .bz2 or a .gz file then this does not happen. The VSZ  
increases but the RSS does not (or at least not that much).


My test:
$ dd if=/dev/urandom of=test_400rand.bin bs=1M count=400
$ zip zip test_400rand.bin.zip test_400rand.bin
$ bzip2 --keep test_400rand.bin
$ gzip test_400rand.bin

The resulting files:
- test_400rand.bin: 400 MB
- test_400rand.bin.zip: 401 MB
- test_400rand.bin.bz2: 402 MB
- test_400rand.bin.gz: 401 MB



Scanning with clamav version 0.97.3 (compiled from source):
$ clamscan --max-filesize=5 --max-scansize=40  
--scan-archive=yes test_400rand.bin.bz2

Start: VSZ: 114MB, RSS: 105MB
Scanning: VSZ: 520MB, RSS: 114MB (mmap of archive file)
Scanning-2: VSZ: 917 MB, RSS: 119 MB (mmap of extracted file)

$ clamscan --max-filesize=5 --max-scansize=40  
--scan-archive=yes test_400rand.bin.gz

Start: VSZ: 115MB, RSS: 105MB
Scanning: VSZ: 515MB, RSS: 114 MB (mmap of archive file)
Scanning-2: VSZ: 916 MB, RSS: 121 MB (mmap of extracted file)

$ clamscan --max-filesize=5 --max-scansize=40  
--scan-archive=yes test_400rand.bin.zip

Start: VSZ: 113MB, RSS: 105MB
Scanning: VSZ: 515MB, RSS: 111MB (mmap of archive file)
Scanning-2: VSZ: 515MB, RSS: 506MB (???)
Scanning-3: VSZ: 916MB, RSS: 514MB (mmap of extracted file)


I looked at the source code and the increase in (RSS) memory happens  
in libclamav/unzip.c in the function 'lhdr':
  if (!csize) { /* FIXME: what's used for method0 files? csize or  
usize? Nothing in the specs, needs testing */

  cli_dbgmsg("cli_unzip: lh - skipping empty file\n");
  } else {
  if(zsize  *ret = unz(zip, csize, usize, LH_method, LH_flags, fu,  
ctx, tmpd);

  }
  }
  zip+=csize;
  zsize-=csize;
  }

The call to 'fmap_need_ptr_once' causes an increase in the RSS memory  
of about 400MB... (csize = 419.498.208, file size = 419.498.372)




Questions:
Is it expected that the RSS memory increases with approximately the  
size of the zip file before extracting it?


Is this necessary? (Tools such as unzip are able to decompress the  
file without loading the entire file into memory).


What is the purpose of fmap/fmap_need_ptr_once?




Best regards,

Bram


___
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net