[ 
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Klimetschek updated OAK-5519:
---------------------------------------
    Description: 
If a text extraction is broken (weird PDF) or a blob cannot be found in the 
datastore or any other error upon indexing one item from the repository that is 
outside the scope of the indexer, it currently halts the complete indexing 
(lane). Thus one broken item (that maybe isn't important to the users at all) 
can block the indexing of other, new content (that might be important to 
users), and it always requires manual intervention to fix (which is also not 
easy and requires oak experts).

Instead, the item could be remembered in a known issue list, proper warnings 
given, and indexing continue. Maintenance operations should be available to 
come back to reindex these once the issue is fixed, or the indexer could 
automatically retry after some time. This would allow normal user activity to 
go on, and solving the problem (if it's isolated to some binaries) can be 
deferred.

I think the line should probably be drawn for binary properties. Not sure if 
other JCR property types could trigger a similar issue, and if a failure in 
them might actually warrant a halt, as it could lead to an "incorrect" index, 
if these properties are important. But maybe the line is simply a try & catch 
around "full text extraction".

  was:
If a text extraction is broken (weird PDF) or a blob cannot be found in the 
datastore or any other error upon indexing one item from the repository that is 
outside the scope of the indexer, it currently halts the complete indexing 
(lane). Thus one broken item (that maybe isn't important to the users at all) 
can block the indexing of other, new content (that might be important to 
users), and it always requires manual intervention to fix (which is also not 
easy and requires oak experts).

Instead, the item could be remembered in a known issue list, proper warnings 
given, and indexing continue. Maintenance operations should be available to 
come back to reindex these once the issue is fixed, or the indexer could 
automatically retry after some time.

I think the line should probably be drawn for binary properties. Not sure if 
other JCR property types could trigger a similar issue, and if a failure in 
them might actually warrant a halt, as it could lead to an "incorrect" index, 
if these properties are important. But maybe the line is simply a try & catch 
around "full text extraction".


> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Alexander Klimetschek
>
> If a text extraction is broken (weird PDF) or a blob cannot be found in the 
> datastore or any other error upon indexing one item from the repository that 
> is outside the scope of the indexer, it currently halts the complete indexing 
> (lane). Thus one broken item (that maybe isn't important to the users at all) 
> can block the indexing of other, new content (that might be important to 
> users), and it always requires manual intervention to fix (which is also not 
> easy and requires oak experts).
> Instead, the item could be remembered in a known issue list, proper warnings 
> given, and indexing continue. Maintenance operations should be available to 
> come back to reindex these once the issue is fixed, or the indexer could 
> automatically retry after some time. This would allow normal user activity to 
> go on, and solving the problem (if it's isolated to some binaries) can be 
> deferred.
> I think the line should probably be drawn for binary properties. Not sure if 
> other JCR property types could trigger a similar issue, and if a failure in 
> them might actually warrant a halt, as it could lead to an "incorrect" index, 
> if these properties are important. But maybe the line is simply a try & catch 
> around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to