[jira] [Comment Edited] (OAK-5519) Skip problematic binaries instead of blocking indexing

Chetan Mehrotra (JIRA) Wed, 08 Nov 2017 03:23:59 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243754#comment-16243754
 ]


Chetan Mehrotra edited comment on OAK-5519 at 11/8/17 11:22 AM:
----------------------------------------------------------------

bq.  the text extraction cache only puts results in the cache if extraction was 
successful. I wonder why that is, it seems failure should also be cached.

+1. Note that currently if a file text extraction fails we store a sentinel 
value "TextExtractionError" in fulltext field for the node (aggregated root) 
being indexed to indicate that there was error processing that. 

Thinking out loud #1- Going forward we can probably store some hidden property 
to mark such binaries to avoid hitting them again (as cache is ephemeral). 
However this would be tricky as IndexEditors currently do not have access to 
NodeBuilder for that node. May be we can store it in index data in some form 
(flat file?) 

Thought #2 - We can possibly store some more data/marker in special field which 
can then later be queried to find out all such files which have not been 
indexed. This would also avoid use of #1 and instead logic can "query" for 
existing lucene doc and see if such a path is blacklisted or not i.e. use 
lucene index for storage of blacklist


was (Author: chetanm):
bq.  the text extraction cache only puts results in the cache if extraction was 
successful. I wonder why that is, it seems failure should also be cached.

+1. Note that currently if a file text extraction fails we store a sentinel 
value "TextExtractionError" in fulltext field for the node (aggregated root) 
being indexed to indicate that there was error processing that. 

Thinking out loud - Going forward we can probably store some hidden property to 
mark such binaries to avoid hitting them again (as cache is ephemeral). However 
this would be tricky as IndexEditors currently do not have access to 
NodeBuilder for that node. May be we can store it in index data in some form 
(flat file?) 

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>            Assignee: Thomas Mueller
>              Labels: resilience
>             Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the indexing (lane). 
> Thus one item (that maybe isn't important to the users at all) can block the 
> indexing of other, new content (that might be important to users), and it 
> always requires manual intervention  (which is also not easy and requires oak 
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these, or the indexer could automatically retry after 
> some time. This would allow normal user activity to go on without manual 
> intervention, and solving the problem (if it's isolated to some binaries) can 
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (OAK-5519) Skip problematic binaries instead of blocking indexing

Reply via email to