[ 
https://issues.apache.org/jira/browse/OAK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553825#comment-14553825
 ] 

Alex Parvulescu edited comment on OAK-2895 at 5/21/15 8:01 AM:
---------------------------------------------------------------

bq. So even just accessing the stream would be costly
good point. I think the patch looks good, +1! 
we may need to document the use of the _TypeDetector_ more prominently.


was (Author: alex.parvulescu):
bq. So even just accessing the stream would be costly
good point. I think the patch looks good, +1! 
we may need to document the use of the _ TypeDetector_ more prominently.

> Avoid accessing binary content if the mimeType is excluded from indexing
> ------------------------------------------------------------------------
>
>                 Key: OAK-2895
>                 URL: https://issues.apache.org/jira/browse/OAK-2895
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>            Priority: Minor
>              Labels: perfomance
>             Fix For: 1.3.0, 1.2.3, 1.0.15
>
>         Attachments: OAK-2895.patch
>
>
> Currently the recommended way to exclude certain types of files from getting 
> indexed is to add them to {{EmptyParser}} in Tika Config. However looking at 
> how Tika works even if mimetype is provided as part metadata. 
> Tika Detector try to determine the mimetype by actually reading some bytes 
> from InputStream [1] before looking up from passed MetaData. This would cause 
> unnecessary IO in case large number of binaries are excluded.
> We would need to look for way where any access to binary content which is not 
> being indexed can be avoided. One option can to expose a multi value config 
> property which takes a list of mimetypes to be excluded from indexing. If the 
> mimeType provided as part of JCR data is part of that excluded list then call 
> to Tika should be avoided
> [1] 
> https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to