Chetan Mehrotra created OAK-2787:
------------------------------------

             Summary: Faster multi threaded indexing for binary content
                 Key: OAK-2787
                 URL: https://issues.apache.org/jira/browse/OAK-2787
             Project: Jackrabbit Oak
          Issue Type: Wish
          Components: lucene
            Reporter: Chetan Mehrotra


With Lucene based indexing the indexing process is single threaded. This hamper 
the indexing of binary content as on a multi processor system only single 
thread can be used to perform the indexing

[~ianeboston] Suggested a possible approach [1] involving a 2 phase indexing
# In first phase detect the nodes to be indexed and start the full text 
extraction of the binary content. Post extraction save the binary token stream 
back to the node as a hidden data. In this phase the node properties can still 
be indexed and a marker field would be added to indicate the fulltext index is 
still pending
# Later in 2nd phase look for all such Lucene docs and then update them with 
the saved token stream

This would allow the text extraction logic to be decouple from Lucene indexing 
logic

[1] http://markmail.org/thread/2w5o4bwqsosb6esu



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to