[jira] [Commented] (TIKA-2454) Emails extracted from PSTs detected as unexpected file types

Tim Allison (JIRA) Wed, 30 Aug 2017 13:13:19 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147973#comment-16147973
 ]


Tim Allison commented on TIKA-2454:
-----------------------------------

[~mcaruanagalizia] and fellow Tika devs, let me know how that looks.  I put the 
override logic in CompositeDetector.

As a side note, I'm not sure why we aren't getting <b> markup now that we're 
grabbing the html instead of the text for the body when it exists in the new 
test pst file I added.  For another issue...


> Emails extracted from PSTs detected as unexpected file types
> ------------------------------------------------------------
>
>                 Key: TIKA-2454
>                 URL: https://issues.apache.org/jira/browse/TIKA-2454
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.16
>            Reporter: Matthew Caruana Galizia
>
> This issue is severe. The Outlook PST parser extracts a string for the body 
> of every email and passes that string to the {{EmbeddedDocumentExtractor}}.
> However, no content type is set on the {{Metadata}} object passed to the 
> extractor. Therefore, if for example, the body of the email starts with the 
> string "From John Smith." (for example, when an email was forwarded), then 
> body of the email is detected as {{application/mbox}} and parsed as though it 
> were an mbox file.
> I think the immediate fix for this issue is to force the type of the email to 
> {{text/plain}} and for it to be parsed as such.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2454) Emails extracted from PSTs detected as unexpected file types

Reply via email to