[
https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243702#comment-16243702
]
Thomas Mueller commented on OAK-5519:
-------------------------------------
I found out why there are two threads consuming 100% each, and not just one:
the text extraction cache only puts results in the cache if extraction was
successful. I wonder why that is, it seems failure should also be cached. What
do you think, [~chetanm], [~catholicon]?
{noformat}
public void put(@Nonnull Blob blob, @Nonnull ExtractedText extractedText) {
String id = blob.getContentIdentity();
if (extractedText.getExtractionResult() ==
ExtractedText.ExtractionResult.SUCCESS && ...) {
cache.put(id, extractedText.getExtractedText().toString());
}
}
{noformat}
> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
> Key: OAK-5519
> URL: https://issues.apache.org/jira/browse/OAK-5519
> Project: Jackrabbit Oak
> Issue Type: New Feature
> Components: indexing
> Reporter: Alexander Klimetschek
> Assignee: Thomas Mueller
> Labels: resilience
> Fix For: 1.8
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the
> datastore or any other error upon indexing one item from the repository that
> is outside the scope of the indexer, it currently halts the indexing (lane).
> Thus one item (that maybe isn't important to the users at all) can block the
> indexing of other, new content (that might be important to users), and it
> always requires manual intervention (which is also not easy and requires oak
> experts).
> Instead, the item could be remembered in a known issue list, proper warnings
> given, and indexing continue. Maintenance operations should be available to
> come back to reindex these, or the indexer could automatically retry after
> some time. This would allow normal user activity to go on without manual
> intervention, and solving the problem (if it's isolated to some binaries) can
> be deferred.
> I think the line should probably be drawn for binary properties. Not sure if
> other JCR property types could trigger a similar issue, and if a failure in
> them might actually warrant a halt, as it could lead to an "incorrect" index,
> if these properties are important. But maybe the line is simply a try & catch
> around "full text extraction".
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)