[Bug 6439] Extend the meaning of "textual parts" like MUAs handle it

bugzilla-daemon Fri, 09 Sep 2022 15:57:19 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6439


--- Comment #24 from Kent Oyer <kent.o...@gmail.com> ---
I had high hopes that using extracttext.pm would be an easy fix but I've
discovered three main problems with this method:

1. extracttext.pm stores it's output in the rendered body part via a call to
`set_rendered`. This is great if you are extracting text from a PDF or image
file. However, if you are using `cat` on an HTML part, you will end up with raw
HTML in your rendered body. So really you need to use a tool like `html2text`
that can render the HTML and extract the visible text so your body rules work
correctly. That's no big deal, however...

2. Unfortunately, rawbody rules will not work at all because rawbody rules run
against text as returned from `get_decoded_body_text_array`. That function only
returns the contents of text/* and message/* parts. 

3. Lastly, extracttext.pm doesn't add discovered URI's to the URI detail list.

I think the goal is to treat all HTML attachments the same, regardless of the
MIME-type. I've attached a patch file that does just that and also includes a
new test case. 

Let me know if you have any questions.

Respectfully,
Kent

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6439] Extend the meaning of "textual parts" like MUAs handle it

Reply via email to