[
https://issues.apache.org/jira/browse/TIKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148043#comment-16148043
]
Hudson commented on TIKA-2454:
------------------------------
SUCCESS: Integrated in Jenkins build Tika-trunk #1353 (See
[https://builds.apache.org/job/Tika-trunk/1353/])
TIKA-2454: add OverrideDetector and allow PSTParser to specify body (tallison:
[https://github.com/apache/tika/commit/83f1afae3db65af966b13e6cc6dae3872aef630f])
* (edit)
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.detect.Detector
* (add)
tika-parsers/src/test/resources/test-documents/testPST_variousBodyTypes.pst
* (edit) CHANGES.txt
* (edit) tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/mbox/OutlookPSTParserTest.java
* (add) tika-core/src/main/java/org/apache/tika/detect/OverrideDetector.java
* (edit)
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
TIKA-2454: don't process the htmlbody. There could be encoding (tallison:
[https://github.com/apache/tika/commit/e0ff3ebff559bcdad690498d40898d426c0b2b02])
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/mbox/OutlookPSTParserTest.java
> Emails extracted from PSTs detected as unexpected file types
> ------------------------------------------------------------
>
> Key: TIKA-2454
> URL: https://issues.apache.org/jira/browse/TIKA-2454
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Affects Versions: 1.16
> Reporter: Matthew Caruana Galizia
> Fix For: 1.17
>
>
> This issue is severe. The Outlook PST parser extracts a string for the body
> of every email and passes that string to the {{EmbeddedDocumentExtractor}}.
> However, no content type is set on the {{Metadata}} object passed to the
> extractor. Therefore, if for example, the body of the email starts with the
> string "From John Smith." (for example, when an email was forwarded), then
> body of the email is detected as {{application/mbox}} and parsed as though it
> were an mbox file.
> I think the immediate fix for this issue is to force the type of the email to
> {{text/plain}} and for it to be parsed as such.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)