[
https://issues.apache.org/jira/browse/OAK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554017#comment-14554017
]
Thomas Mueller commented on OAK-2895:
-------------------------------------
It looks good to me. About the LazyInputStream: I would probably set the input
stream to a closed stream after calling closed, if it was never opened, to
avoid NPE. Removing "synchronized" could be considered. Using the
LazyInputStream in a lower level could be considered later as well.
> Avoid accessing binary content if the mimeType is excluded from indexing
> ------------------------------------------------------------------------
>
> Key: OAK-2895
> URL: https://issues.apache.org/jira/browse/OAK-2895
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Priority: Minor
> Labels: perfomance
> Fix For: 1.3.0, 1.2.3, 1.0.15
>
> Attachments: OAK-2895.patch
>
>
> Currently the recommended way to exclude certain types of files from getting
> indexed is to add them to {{EmptyParser}} in Tika Config. However looking at
> how Tika works even if mimetype is provided as part metadata.
> Tika Detector try to determine the mimetype by actually reading some bytes
> from InputStream [1] before looking up from passed MetaData. This would cause
> unnecessary IO in case large number of binaries are excluded.
> We would need to look for way where any access to binary content which is not
> being indexed can be avoided. One option can to expose a multi value config
> property which takes a list of mimetypes to be excluded from indexing. If the
> mimeType provided as part of JCR data is part of that excluded list then call
> to Tika should be avoided
> [1]
> https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)