[jira] [Comment Edited] (OAK-2895) Avoid accessing binary content if the mimeType is excluded from indexing

Chetan Mehrotra (JIRA) Thu, 21 May 2015 00:47:05 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553797#comment-14553797
 ]


Chetan Mehrotra edited comment on OAK-2895 at 5/21/15 7:45 AM:
---------------------------------------------------------------

Thanks [~alex.parvulescu] for the hint. Attached is the [patch|^OAK-2895.patch] 
with testcase. It also takes care of not eagerly accessing the backing input 
stream.

It uses the same approach as used in LazyFileInputStream [1] to lazily access 
the Blob stream. This is required so as to avoid cost of accessing the file. 
Note that for S3 at time of first access itself the whole file gets copied [2]. 
So even just accessing the stream would be costly

[~alex.parvulescu] Can you have a look at Lucene stuff

[~tmueller] Can you review the new api being added in oak-commons?

[1] 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-data/src/main/java/org/apache/jackrabbit/core/data/LazyFileInputStream.java
[2] 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-data/src/main/java/org/apache/jackrabbit/core/data/CachingDataStore.java#L605


was (Author: chetanm):
Thanks [~alex.parvulescu] for the hint. Attached is the [patch|^OAK-2895.patch] 
with testcase. It also takes care of not eagerly accessing the backing input 
stream.

It uses the same approach as used in LazyFileInputStream [1] to lazily access 
the Blob stream. This is required so as to avoid cost of accessing the file. 
Note that for S3 at time of first access itself the whole file gets copied [2]. 
So even just accessing the stream would be costly

[~alex.parvulescu] [~tmueller] Can you have a look?

[1] 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-data/src/main/java/org/apache/jackrabbit/core/data/LazyFileInputStream.java
[2] 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-data/src/main/java/org/apache/jackrabbit/core/data/CachingDataStore.java#L605

> Avoid accessing binary content if the mimeType is excluded from indexing
> ------------------------------------------------------------------------
>
>                 Key: OAK-2895
>                 URL: https://issues.apache.org/jira/browse/OAK-2895
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>            Priority: Minor
>              Labels: perfomance
>             Fix For: 1.3.0, 1.2.3, 1.0.15
>
>         Attachments: OAK-2895.patch
>
>
> Currently the recommended way to exclude certain types of files from getting 
> indexed is to add them to {{EmptyParser}} in Tika Config. However looking at 
> how Tika works even if mimetype is provided as part metadata. 
> Tika Detector try to determine the mimetype by actually reading some bytes 
> from InputStream [1] before looking up from passed MetaData. This would cause 
> unnecessary IO in case large number of binaries are excluded.
> We would need to look for way where any access to binary content which is not 
> being indexed can be avoided. One option can to expose a multi value config 
> property which takes a list of mimetypes to be excluded from indexing. If the 
> mimeType provided as part of JCR data is part of that excluded list then call 
> to Tika should be avoided
> [1] 
> https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (OAK-2895) Avoid accessing binary content if the mimeType is excluded from indexing

Reply via email to