[Bug 7579] PDFInfo: pdfinfo:pdf_has_uri

bugzilla-daemon Mon, 12 Apr 2021 07:31:35 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7579


--- Comment #13 from Henrik Krohns <[email protected]> ---
Let's say some large PDF has a hundred unique "uris" for one reason or another.
How would we manage this? Should we prefer to URIBL query them instead of body
uris? Or shuffle and take n-amount of uris from here and there? How will
different __URI* rules react, which depend on count / number of hits?

I'm quite sceptical that even ExtractText makes any sense. It has the same
problems, along with possibly filling Bayes with semi-random stuff from badly
OCR'd images or wonky rendered PDF's etc.

I think would just vote to have a pdf_has_uri() which can match uris from PDFs
and that's it. No complex metadata hassles.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7579] PDFInfo: pdfinfo:pdf_has_uri

Reply via email to