Cédric Damioli created JCR-3667: ----------------------------------- Summary: Possible regression with accepted content types when extracting and indexing binary values Key: JCR-3667 URL: https://issues.apache.org/jira/browse/JCR-3667 Project: Jackrabbit Content Repository Issue Type: Bug Affects Versions: 2.4.4 Reporter: Cédric Damioli Fix For: 2.4.5, 2.6.4, 2.7.2
JCR-3476 introduced a mime-type test before parsing binary values, based on Tika's supported parsers. This may lead to incorrect behaviours, with a "text/xml" not being extracted and indexed because the XMLParser does not declare "text/xml" as a supported type. The problem here is that there is a regression between 2.4.3 and 2.4.4, because the same content was previously well recognized by Tika's Detector and then extracted. Furthermore, it seems to me inconsistent on one hand to rely on the declared content type and on the other hand to delegate the actual type detection to Tika ? This may lead to cases where the jcr:mimeType value is set to eg. "application/pdf" but detected and parsed by Tika as "text/plain" with no error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira