[Bug 8107] New: Change how PDF's are parsed with the PDFInfo plugin

bugzilla-daemon Wed, 18 Jan 2023 19:53:09 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8107


            Bug ID: 8107
           Summary: Change how PDF's are parsed with the PDFInfo plugin
           Product: Spamassassin
           Version: 4.0.0
          Hardware: PC
                OS: Windows 10
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Plugins
          Assignee: dev@spamassassin.apache.org
          Reporter: kent.o...@gmail.com
  Target Milestone: Undefined

I would like to discuss a possible rewrite of the PDFInfo plugin. The main
issue I'm running into is that it does not detect images 100% of the time. For
instance, if '/Height' and '/Width' are on different lines, the image is not
detected. Also, if either '/Height' or '/Width' comes before '/Image' the image
is not detected. I've looked for simple ways to fix it but I believe the best
fix is to parse the PDF correctly using the PDF object structure instead of the
current line-oriented method. Parsing the PDF object tree would allow the
following additional features:

1. Differentiating between images displayed on the page vs images used as a
mask for other images (the latter can probably be ignored)

2. Take scaling into account. The pixel dimensions of the image are not related
to the amount of area the image consumes on the page. For example, you can have
a 400x600 image that takes up the whole page or you can have a 1200x900 image
that only takes up 25% of the page. 

3. Images can be defined once and used multiple times on the page or on
multiple pages.

4. We could prioritize content on page 1 (or simply ignore content on all other
pages). Spammers usually put the payload on page 1 and if there are other
pages, it's only there to confuse the filters.

5. Access images and URI's located in binary data. 

I've already started working on this and I think it's doable but I don't want
to duplicate work if someone else is already working on it. I would also like
feedback on whether this should be a drop-in replacement or a totally new
plugin. I would like to maintain backward compatibility but there would be
differences in how image-to-text ratios are calculated and the fuzzy MD5
checksums would be different unless I keep the existing code (and parse each
file twice) just to avoid changing the checksums. 


Any thoughts?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 8107] New: Change how PDF's are parsed with the PDFInfo plugin

Reply via email to