[jira] [Commented] (JCR-3667) Possible regression with accepted content types when extracting and indexing binary values

Jukka Zitting (JIRA) Mon, 07 Oct 2013 12:59:45 -0700

    [ 
https://issues.apache.org/jira/browse/JCR-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788509#comment-13788509
 ]


Jukka Zitting commented on JCR-3667:
------------------------------------

OK, I see the problem. We'll probably want to handle the 1.3 to 1.4 upgrade in 
a separate improvement issue, and come up with a separate solution to this 
problem. IIUC, the problem is that Tika in this case does not properly 
normalize the type names which leads to the mismatch between the detected and 
supported types. To avoid that problem we could explicitly ask Tika to 
normalize the type names.

> Possible regression with accepted content types when extracting and indexing 
> binary values
> ------------------------------------------------------------------------------------------
>
>                 Key: JCR-3667
>                 URL: https://issues.apache.org/jira/browse/JCR-3667
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>    Affects Versions: 2.4.4, 2.6.3
>            Reporter: Cédric Damioli
>            Assignee: Jukka Zitting
>              Labels: patch
>             Fix For: 2.7.2
>
>
> JCR-3476 introduced a mime-type test before parsing binary values, based on 
> Tika's supported parsers.
> This may lead to incorrect behaviours, with a "text/xml" not being extracted 
> and indexed because the XMLParser does not declare "text/xml" as a supported 
> type.
> The problem here is that there is a regression between 2.4.3 and 2.4.4, 
> because the same content was previously well recognized by Tika's Detector 
> and then extracted.
> Furthermore, it seems to me inconsistent on one hand to rely on the declared 
> content type and on the other hand to delegate the actual type detection to 
> Tika ? 
> This may lead to cases where the jcr:mimeType value is set to eg. 
> "application/pdf" but detected and parsed by Tika as "text/plain" with no 
> error.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (JCR-3667) Possible regression with accepted content types when extracting and indexing binary values

Reply via email to