[Bug 7579] PDFInfo: pdfinfo:pdf_has_uri

bugzilla-daemon Tue, 13 Apr 2021 13:52:52 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7579


--- Comment #15 from Giovanni Bechis <[email protected]> ---
(In reply to Henrik Krohns from comment #13)
> Let's say some large PDF has a hundred unique "uris" for one reason or
> another. How would we manage this? Should we prefer to URIBL query them
> instead of body uris? Or shuffle and take n-amount of uris from here and
> there? How will different __URI* rules react, which depend on count / number
> of hits?
> 
> I'm quite sceptical that even ExtractText makes any sense. It has the same
> problems, along with possibly filling Bayes with semi-random stuff from
> badly OCR'd images or wonky rendered PDF's etc.
> 
> I think would just vote to have a pdf_has_uri() which can match uris from
> PDFs and that's it. No complex metadata hassles.

ExtractText could poison Bayes databases but a lot of other sources can do the
same, on the other hand it can parse .docx files and images as well and not
just pdf files.
A warning about using ExtractText together with Bayes is a good idea anyway.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7579] PDFInfo: pdfinfo:pdf_has_uri

Reply via email to