Cédric Damioli created JCR-3667:
-----------------------------------

             Summary: Possible regression with accepted content types when 
extracting and indexing binary values
                 Key: JCR-3667
                 URL: https://issues.apache.org/jira/browse/JCR-3667
             Project: Jackrabbit Content Repository
          Issue Type: Bug
    Affects Versions: 2.4.4
            Reporter: Cédric Damioli
             Fix For: 2.4.5, 2.6.4, 2.7.2


JCR-3476 introduced a mime-type test before parsing binary values, based on 
Tika's supported parsers.
This may lead to incorrect behaviours, with a "text/xml" not being extracted 
and indexed because the XMLParser does not declare "text/xml" as a supported 
type.

The problem here is that there is a regression between 2.4.3 and 2.4.4, because 
the same content was previously well recognized by Tika's Detector and then 
extracted.

Furthermore, it seems to me inconsistent on one hand to rely on the declared 
content type and on the other hand to delegate the actual type detection to 
Tika ? 
This may lead to cases where the jcr:mimeType value is set to eg. 
"application/pdf" but detected and parsed by Tika as "text/plain" with no error.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to