https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7579

--- Comment #13 from Henrik Krohns <[email protected]> ---
Let's say some large PDF has a hundred unique "uris" for one reason or another.
How would we manage this? Should we prefer to URIBL query them instead of body
uris? Or shuffle and take n-amount of uris from here and there? How will
different __URI* rules react, which depend on count / number of hits?

I'm quite sceptical that even ExtractText makes any sense. It has the same
problems, along with possibly filling Bayes with semi-random stuff from badly
OCR'd images or wonky rendered PDF's etc.

I think would just vote to have a pdf_has_uri() which can match uris from PDFs
and that's it. No complex metadata hassles.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to