On 2009-04-26 18:21, Tom Shaw wrote:
> .ndb questions
>
> TargetType is confusing and very unclear.
>
>       Type 2 What exactly is type 2. I first read this ad thought 
> it was OLE executables but further reading indicates it might also 
> include Excel, Word VB and other Microsoft files. True?

Type 2 is for files contained inside an OLE container. This includes
files inside Excel, Word, etc. since those are OLE2 containers too.
These files can be images, embedded executables, VBA scripts, ...

>  Are they 
> normalized?
>   

VBA macros are decoded, everything else is simply extracted and scanned
with type 2 signatures.

>       Type 3 What exactly is normaized HTML? 

whitespace transformed to spaces, tags/tag attributes normalized, all
lowercased.
Run clamscan --leave-temps --tempdir=. yourfile.html, and look at the
files created.

> What happens to non 
> ascii/Latin encodings? UTF? 

HTML entities and char references are decoded, everything in ASCII range
(<0x80) is output as is,
the rest is transformed to &#xNNNN;

> Line terminators (\r,\r\n,\n)?

All consecutive whitespace is replaced with a single space.

>  PHP, 
>   

No special treatment for PHP.

> Javascript 

Javascript is normalized too:
- strings are normalized (hex encoding is decoded)
- numbers are parsed and normalized
- local variables/function names are normalized to 'n001' format
- argument to eval() is parsed as JS again
- unescape() is handled
- some simple JS packers are handled
- output is whitespace normalized

> and html escapes and html entities? Does this type get 
> applied to Mail?

It gets applied to the HTML part of mail (if any).

>  Does this type get applied to Mail when there are no 
> HTML MIME sections? What other files is it applied to?
>   

Applied to any file that matches the HTML filetype signature in daily.ftm.

>       Type 4 signatures appear not to operate on any file that 
> doesn't look like an 2821 document.

Type 4 is for mails, yes.

>  Is this true? Are the internal 
> encoding (such as QP or B64) decoded before applying signatures? In 
> QP are =\r\n removed? For 8-bit mail what is done for the non-ascii 
> encodings? Upper/lower case? Line terminators (\r,\r\n,\n)? Should 
> UTF be considered? 

Mail body is decoded (quoted-printable, etc.), but no normalization is done.

> If type 4 is for only 2821 mail format, is Type 7 
> for all "text and script files including mail?
>   

Type 7  is applied only if it isn't HTML filetype (type 3).

>       Type 5 I assume are binary files such as jpg, png, tiff, swf mov, etc?
>   

Yes.

>       Type 7 What does normalized mean? What happens for characters 
> above 127 or for UTF? Line terminators (\r,\r\n,\n)?

Whitespace is normalized, ASCII characters <127 are output, everything
else is stripped from the output.
Hence it works on UTF16/32 variants too.

>  Does this type 
> get applied to Mail as well? What other files is it applied to?
>   

Applied to any files detected as text, who's size doesn't exceed a
threshold.


>       Clamdocs specify clam having special processing for Office, 
> RTF and PDF as well as HTML yet there are no "normalized" nor 
> non-"normalized"types for these file formats.
>   

Office files: VBA extracted => type 2
RTF: embedded files extracted => no special type needed, type 1 sigs can
be used for executables, etc.

Type 3 is normalized HTML.

>       I assume that signatures of these types are applied to both 
> uncompressed and compressed versions of the file.
>   

You mean stored inside zip, rar, etc? Files are extracted from archives
first, then each archive member is scanned
with the appropriate signatures for its filetype.

> Wildcards
>
>       Would be nice to have a wildcard that allowed a range of 
> matching like regex *{6,8}
>   

There are already range wildcards: {6-8}

>       Would be nice to be able to have wildcards to match ascii 
> numbers and ascci letters.
>
>   

You can use (30|31|32|33|34|35|36|37|38|39) for numbers, and signatures
for letters can be constructed using multiple signatures in a logical
signature.
However if  its any letter, why not just skip over it with ??. What is
the added benefit of matching [a-zA-Z]?

> Thanks for any and all clarifications and insights.
>
> For clamav/sourcefire folks, if any answer to above is no, could you 
> consider adding in the future?
>   

Please open a bugreport marked as enhancement if you want a feature that
is not implemented, and describe why that feature would be useful.

Best regards,
--Edwin

_______________________________________________
Help us build a comprehensive ClamAV guide: visit http://wiki.clamav.net
http://www.clamav.net/support/ml

Reply via email to