[ 
https://issues.apache.org/jira/browse/OAK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Mehrotra updated OAK-2895:
---------------------------------
    Description: 
Currently the recommended way to exclude certain types of files from getting 
indexed is to add them to {{EmptyParser}} in Tika Config. However looking at 
how Tika works even if mimetype is provided as part metadata. 

Tika Detector try to determine the mimetype by actually reading some bytes from 
InputStream [1] before looking up from passed MetaData. This would cause 
unnecessary IO in case large number of binaries are excluded.

We would need to look for way where any access to binary content which is not 
being indexed can be avoided. One option can to expose a multi value config 
property which takes a list of mimetypes to be excluded from indexing. If the 
mimeType provided as part of JCR data is part of that excluded list then call 
to Tika should be avoided

[1] 
https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446

  was:
Currently the recommended way to exclude certain types of files from getting 
indexed is to add them to {{EmptyParser}} in Tika Config. However looking at 
how Tika works even if mimetype is provided as part metadata. 

Tika Detector try to determine the mimetype by actually reading some bytes from 
InputStream [1] before looking up from passed MetaData. This would cause 
unnecessary IO in case large number of binaries are excluded.

To avoid this IO we should expose a multi value config property which takes a 
list of mimetypes to be excluded from indexing. If the mimeType provided as 
part of JCR data is part of that excluded list then call to Tika should be 
avoided

[1] 
https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446


> Avoid accessing binary content if the mimeType is excluded from indexing
> ------------------------------------------------------------------------
>
>                 Key: OAK-2895
>                 URL: https://issues.apache.org/jira/browse/OAK-2895
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>            Priority: Minor
>             Fix For: 1.3.0, 1.2.3, 1.0.15
>
>
> Currently the recommended way to exclude certain types of files from getting 
> indexed is to add them to {{EmptyParser}} in Tika Config. However looking at 
> how Tika works even if mimetype is provided as part metadata. 
> Tika Detector try to determine the mimetype by actually reading some bytes 
> from InputStream [1] before looking up from passed MetaData. This would cause 
> unnecessary IO in case large number of binaries are excluded.
> We would need to look for way where any access to binary content which is not 
> being indexed can be avoided. One option can to expose a multi value config 
> property which takes a list of mimetypes to be excluded from indexing. If the 
> mimeType provided as part of JCR data is part of that excluded list then call 
> to Tika should be avoided
> [1] 
> https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to