[
https://issues.apache.org/jira/browse/OAK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davide Giannella updated OAK-2787:
----------------------------------
Fix Version/s: 1.16.0
> Faster multi threaded indexing / text extraction for binary content
> -------------------------------------------------------------------
>
> Key: OAK-2787
> URL: https://issues.apache.org/jira/browse/OAK-2787
> Project: Jackrabbit Oak
> Issue Type: Wish
> Components: lucene
> Reporter: Chetan Mehrotra
> Priority: Major
> Fix For: 1.14.0, 1.16.0
>
>
> With Lucene based indexing the indexing process is single threaded. This
> hamper the indexing of binary content as on a multi processor system only
> single thread can be used to perform the indexing
> [~ianeboston] Suggested a possible approach [1] involving a 2 phase indexing
> # In first phase detect the nodes to be indexed and start the full text
> extraction of the binary content. Post extraction save the binary token
> stream back to the node as a hidden data. In this phase the node properties
> can still be indexed and a marker field would be added to indicate the
> fulltext index is still pending
> # Later in 2nd phase look for all such Lucene docs and then update them with
> the saved token stream
> This would allow the text extraction logic to be decouple from Lucene
> indexing logic
> [1] http://markmail.org/thread/2w5o4bwqsosb6esu
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)