LuceneIndexEditor currently extract the binary contents via Tika in
same thread which is used for processing the commit. Such an approach
does not make good use of multi processor system specifically when
index is being built up as part of migration process.

Looking at JR2 I see LazyTextExtractor [1] which I think would help in
parallelize text extraction.

Would it make sense to bring this to Oak. Would that help in improving
performance?

Chetan Mehrotra
[1] 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java

Reply via email to