Hi Eric,

Thank you for the in-depth analysis of the PDF scanning speed issue.

We took a look at the bytecode (BC) signatures and considering the performance 
impact and value of the detections we decided to drop these signatures.  You 
should have seen them drop in yesterday's update to the bytecode.cvd database.  
I'm hopeful that this mostly resolves the concern regarding slow PDF scans.

With regards to your analysis of the PDF object dictionary parsing, I could use 
your help.  You mention that the state is not reset when looking for object 
dictionary keys which causes the ordering to matter.  You implied that this 
causes ClamAV's PDF parser to fail to extract (dump) some images.  We should 
fix it so that it will correctly extract every image, as image detection is 
very useful in identifying phishing documents and other malicious documents and 
emails.

If you have any specific recommendations for fixing this issue, we would 
appreciate it.

Also, if you have sample files that I could debug which illustrate the image 
extraction issue you described, I would appreciate a copy.

On a side note, we will be investigating looking into using pdfium or another 
third-party PDF parser in the future in order to improve detection and 
performance.  It is possible that we will replace our own PDF parser partially 
or entirely depending on the results of this investigation.  I mention this so 
that you do not spend a tremendous effort on this issue.

Regards,
Micah


Micah Snyder (they/them)
ClamAV Development
Talos
Cisco Systems, Inc.
________________________________
From: clamav-users <clamav-users-boun...@lists.clamav.net> on behalf of Eric 
Zhou via clamav-users <clamav-users@lists.clamav.net>
Sent: Thursday, February 22, 2024 2:29 PM
To: clamav-users@lists.clamav.net <clamav-users@lists.clamav.net>
Cc: Eric Zhou <eric.z...@five9.com>
Subject: [clamav-users] Slow PDF Scanning pt 3.


Hi ClamAV team and users,



This is a follow up to my previous posts, which can be found 
here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html> 
& 
here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html>.
 I wanted to give a summary and make sure the problem identified is clear.



My team and I have noticed that ClamAV can be very slow in scanning certain PDF 
files. When we investigated the matter, we discovered the potential root cause 
within ClamAV source code. In 
https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1984,
 ClamAV handles PDF document tags. This function comes with a state to properly 
handle tags that require parameters. However, this state is not reset after 
parameters are parsed, so parsing is sensitive to the order in which tags are 
listed in the dictionary.



For example, this collection of headers for a PDF will scan fast because image 
subtype is before all filters:



```

429 0 obj << /ColorSpace /DeviceRGB /Name /im56 /Height 2850 /Subtype /Image 
/Filter /FlateDecode /DecodeParms << /Columns 1776 /Colors 3 /Predictor 2 >> 
/Type /XObject /Width 1776 /Length 25686 /BitsPerComponent 8 /Interpolate true 
>> stream

```



However, this collection of headers for a PDF will scan slow because image 
subtype comes after filter (image will be dumped, though it should not be):



```

454 0 obj<</Length 455 0 R/Filter/FlateDecode/DecodeParms<</Columns 
1776/Predictor 2/Colors 3>>/Width 1776/Height 2850/BitsPerComponent 
8/ColorSpace/DeviceRGB/Interpolate 
true/Type/XObject/Name/im56/Subtype/Image>>stream

```



Finally, in this line: 
https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1580,
 we see references to parameters, but they are used after tags are parsed. And 
neither DP nor DecodeParms are in `pdfname_actions`, so they are not affecting 
state.



Slow PDF scanning has been a known problem for 3 years, and it would be nice to 
see it addressed in a new patch soon.



Again, I’m happy to provide more details if needed. Thank you for your time.



Best,

Eric





________________________________

CONFIDENTIALITY NOTICE: This e-mail and any files attached may contain 
confidential information of Five9 and/or its affiliated entities. Access by the 
intended recipient only is authorized. Any liability arising from any party 
acting, or refraining from acting, on any information contained in this e-mail 
is hereby excluded. If you are not the intended recipient, please notify the 
sender immediately, destroy the original transmission and its attachments and 
do not disclose the contents to any other person, use it for any purpose, or 
store or copy the information in any medium. Copyright in this e-mail and any 
attachments belongs to Five9 and/or its affiliated entities.
_______________________________________________

Manage your clamav-users mailing list subscription / unsubscribe:
https://lists.clamav.net/mailman/listinfo/clamav-users


Help us build a comprehensive ClamAV guide:
https://github.com/Cisco-Talos/clamav-documentation

https://docs.clamav.net/#mailing-lists-and-chat

Reply via email to