[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104727#comment-16104727
 ] 

Chetan Mehrotra commented on OAK-5519:
--------------------------------------

bq.  it does nothing except throw an exception / error / out of memory error 
every time

Currently the 
[BinaryTextExtractor|https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/binary/BinaryTextExtractor.java#L161]
 is capturing throwable while calling Tika (thinking about it we should handle 
it better and distinguish between serious error). 

The case which is not handled is where parser enters into some infinite loop 
while processing and for which the process needs to be killed. May be we use an 
executor to handle text extraction and on indexer thread submit a job to it 
with some timeout. In case of any such error the executor pool thread would get 
blocked but indexer thread can continue and system admin can look into 
restarting the process. This would allow us to detect such problamatic binaries 
more reliably and we only need to remember them in case of timeout or some 
exception in processing.

Thoughts?

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>              Labels: resilience
>             Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to