[ 
https://issues.apache.org/jira/browse/TIKA-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4701:
------------------------------
    Attachment: eval-reports.tar.gz

> Use unencapsulated HTML body when it exists in MSGs
> ---------------------------------------------------
>
>                 Key: TIKA-4701
>                 URL: https://issues.apache.org/jira/browse/TIKA-4701
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: eval-reports.tar.gz
>
>
> We recently added a hack to decapsulate html from RTF in msgs for the 
> purposes of identifying inline images.
> On a set of msgs from recent commoncrawls, it is clear that encapsulated html 
> within RTF is a major thing. I propose improving our decapsulate code and 
> using the decapsulated html as the body text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to