[ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2478.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.17

Many thanks to [~letzlerr] for opening this issue!

Many thanks to [~kkrugler] for the test files and guidance!

For those who want the legacy behavior, use this in your tika-config.xml:

{noformat}
        <parser class="org.apache.tika.parser.mail.RFC822Parser">
            <params>
                <param name="extractAllAlternatives" type="bool">true</param>
            </params>
        </parser>
{noformat}

And for those who want (new) comparable behavior in the OutlookParser:
{noformat}
        <parser class="org.apache.tika.parser.microsoft.OfficeParser">
            <params>
                <param name="extractAllAlternativesFromMSG" 
type="bool">true</param>
            </params>
        </parser>
{noformat}

> RFC822 includes redundant copies of the text
> --------------------------------------------
>
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.17
>
>         Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, 
> mixed-simple, mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a.    The mbox file - outer container "/"
> b.    The actual email--  "/embedded-1"
> c.    The utf-8 text content of the email "/embedded-1/embedded-2"
> d.    The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to