[jira] [Created] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

Chetan Mehrotra (JIRA) Wed, 20 May 2015 05:45:13 -0700

Chetan Mehrotra created OAK-2892:
------------------------------------

             Summary: Speed up lucene indexing post migration by pre extracting 
the text content from binaries
                 Key: OAK-2892
                 URL: https://issues.apache.org/jira/browse/OAK-2892
             Project: Jackrabbit Oak
          Issue Type: New Feature
          Components: lucene, run
            Reporter: Chetan Mehrotra
            Assignee: Chetan Mehrotra
             Fix For: 1.3.0, 1.0.15



While migrating large repositories say having 3 M docs (250k PDF) Lucene 
indexing takes long time to complete (at time 4 days!). Currently the text 
extraction logic is coupled with Lucene indexing and hence is performed in a 
single threaded mode which slows down the indexing process. Further if the 
reindexing has to be triggered it has to be done all over again.

To speed up the Lucene indexing we can decouple the text extraction
from actual indexing. It is partly based on discussion on OAK-2787

# Introduce a new ExtractedTextProvider which can provide extracted
text for a given Blob instance
# In oak-run introduce a new indexer mode - This would take a path in
repository and would then traverse the repository and look for existing 
binaries and extract text from that

So before or after migration is done one can run this oak-run tool to create 
this store which has the text already extracted. Then post startup we need to 
wire up the ExtractedTextProvider instance (which is backed by the BlobStore 
populated before) and indexing logic can just get content from that. This would 
avoid performing expensive text extraction in the indexing thread.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries

Reply via email to