[
https://issues.apache.org/jira/browse/TIKA-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeremy B. Merrill updated TIKA-1771:
------------------------------------
Description:
Emails I have (happy to share if you want) contain XHTML, as one part of a
multipart email. Prior to this pull request, the priority on the
application/xhtml+xml magic detector was 50, equal to the priority on the
message/rfc822 detector. Because of the relative position of the two detectors
in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents.
With this PR, by downgrading the priority of application/xhtml+xml to 40, the
more-sensitive email magic detectors take precedence, causing the emails to be
properly detected as message/rfc822.
I have not run this thru the govdocs tester or anything other than my own
documents, so, full disclosure, this could cause false negative
xhtml-detections elsewhere.
I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish.
was:
Emails I have (happy to share if you want) contain XHTML, as one part of a
multipart email. Prior to this pull request, the priority on the
application/xhtml+xml magic detector was 50, equal to the priority on the
message/rfc822 detector. Because of the relative position of the two detectors
in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents.
With this PR, by downgrading the priority of application/xhtml+xml to 40, the
more-sensitive email magic detectors take precedence, causing the emails to be
properly detected as message/rfc822.
I have not run this thru the govdocs tester or anything other than my own
documents, so, full disclosure, this could cause false negative
xhtml-detections elsewhere.
> lower magic priority xhtml magic priority to ensure emails detected as
> message/rfc822
> -------------------------------------------------------------------------------------
>
> Key: TIKA-1771
> URL: https://issues.apache.org/jira/browse/TIKA-1771
> Project: Tika
> Issue Type: Improvement
> Components: detector
> Reporter: Jeremy B. Merrill
> Priority: Critical
>
> Emails I have (happy to share if you want) contain XHTML, as one part of a
> multipart email. Prior to this pull request, the priority on the
> application/xhtml+xml magic detector was 50, equal to the priority on the
> message/rfc822 detector. Because of the relative position of the two
> detectors in tika-mimetypes.xml, the emails were incorrectly detected as
> XHTML documents.
> With this PR, by downgrading the priority of application/xhtml+xml to 40, the
> more-sensitive email magic detectors take precedence, causing the emails to
> be properly detected as message/rfc822.
> I have not run this thru the govdocs tester or anything other than my own
> documents, so, full disclosure, this could cause false negative
> xhtml-detections elsewhere.
> I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)