Cédric Damioli created JCR-3667:
-----------------------------------
Summary: Possible regression with accepted content types when
extracting and indexing binary values
Key: JCR-3667
URL: https://issues.apache.org/jira/browse/JCR-3667
Project: Jackrabbit Content Repository
Issue Type: Bug
Affects Versions: 2.4.4
Reporter: Cédric Damioli
Fix For: 2.4.5, 2.6.4, 2.7.2
JCR-3476 introduced a mime-type test before parsing binary values, based on
Tika's supported parsers.
This may lead to incorrect behaviours, with a "text/xml" not being extracted
and indexed because the XMLParser does not declare "text/xml" as a supported
type.
The problem here is that there is a regression between 2.4.3 and 2.4.4, because
the same content was previously well recognized by Tika's Detector and then
extracted.
Furthermore, it seems to me inconsistent on one hand to rely on the declared
content type and on the other hand to delegate the actual type detection to
Tika ?
This may lead to cases where the jcr:mimeType value is set to eg.
"application/pdf" but detected and parsed by Tika as "text/plain" with no error.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira