https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7579
--- Comment #15 from Giovanni Bechis <[email protected]> --- (In reply to Henrik Krohns from comment #13) > Let's say some large PDF has a hundred unique "uris" for one reason or > another. How would we manage this? Should we prefer to URIBL query them > instead of body uris? Or shuffle and take n-amount of uris from here and > there? How will different __URI* rules react, which depend on count / number > of hits? > > I'm quite sceptical that even ExtractText makes any sense. It has the same > problems, along with possibly filling Bayes with semi-random stuff from > badly OCR'd images or wonky rendered PDF's etc. > > I think would just vote to have a pdf_has_uri() which can match uris from > PDFs and that's it. No complex metadata hassles. ExtractText could poison Bayes databases but a lot of other sources can do the same, on the other hand it can parse .docx files and images as well and not just pdf files. A warning about using ExtractText together with Bayes is a good idea anyway. -- You are receiving this mail because: You are the assignee for the bug.
