https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8107
Bug ID: 8107 Summary: Change how PDF's are parsed with the PDFInfo plugin Product: Spamassassin Version: 4.0.0 Hardware: PC OS: Windows 10 Status: NEW Severity: normal Priority: P2 Component: Plugins Assignee: dev@spamassassin.apache.org Reporter: kent.o...@gmail.com Target Milestone: Undefined I would like to discuss a possible rewrite of the PDFInfo plugin. The main issue I'm running into is that it does not detect images 100% of the time. For instance, if '/Height' and '/Width' are on different lines, the image is not detected. Also, if either '/Height' or '/Width' comes before '/Image' the image is not detected. I've looked for simple ways to fix it but I believe the best fix is to parse the PDF correctly using the PDF object structure instead of the current line-oriented method. Parsing the PDF object tree would allow the following additional features: 1. Differentiating between images displayed on the page vs images used as a mask for other images (the latter can probably be ignored) 2. Take scaling into account. The pixel dimensions of the image are not related to the amount of area the image consumes on the page. For example, you can have a 400x600 image that takes up the whole page or you can have a 1200x900 image that only takes up 25% of the page. 3. Images can be defined once and used multiple times on the page or on multiple pages. 4. We could prioritize content on page 1 (or simply ignore content on all other pages). Spammers usually put the payload on page 1 and if there are other pages, it's only there to confuse the filters. 5. Access images and URI's located in binary data. I've already started working on this and I think it's doable but I don't want to duplicate work if someone else is already working on it. I would also like feedback on whether this should be a drop-in replacement or a totally new plugin. I would like to maintain backward compatibility but there would be differences in how image-to-text ratios are calculated and the fuzzy MD5 checksums would be different unless I keep the existing code (and parse each file twice) just to avoid changing the checksums. Any thoughts? -- You are receiving this mail because: You are the assignee for the bug.