Chetan Mehrotra created OAK-2895:
------------------------------------

             Summary: Provide config option to exclude certain mimeTypes from 
getting indexed
                 Key: OAK-2895
                 URL: https://issues.apache.org/jira/browse/OAK-2895
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: lucene
            Reporter: Chetan Mehrotra
            Assignee: Chetan Mehrotra
            Priority: Minor
             Fix For: 1.3.0, 1.2.3, 1.0.15


Currently the recommended way to exclude certain types of files from getting 
indexed is to add them to {{EmptyParser}} in Tika Config. However looking at 
how Tika works even if mimetype is provided as part metadata. 

Tika Detector try to determine the mimetype by actually reading some bytes from 
InputStream [1] before looking up from passed MetaData. This would cause 
unnecessary IO in case large number of binaries are excluded.

To avoid this IO we should expose a multi value config property which takes a 
list of mimetypes to be excluded from indexing. If the mimeType provided as 
part of JCR data is part of that excluded list then call to Tika should be 
avoided

[1] 
https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to