[jira] [Commented] (OAK-2787) Faster multi threaded indexing for binary content

Thomas Mueller (JIRA) Tue, 21 Jun 2016 04:58:03 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341638#comment-15341638
 ]


Thomas Mueller commented on OAK-2787:
-------------------------------------

> Have a new CommitHook which looks for new binary property being added

This is a bit different to my suggestion. I guess both options have advantages 
and disadvantages.

In my case, the datastore would be fully responsible, and text extraction would 
be part of adding a binary, and therefore before the Session.save call. Move 
and copy operations would not need any special logic. With a CommitHook, if a 
binary reference is copied to another node, then the extracted text hidden 
property need to be copied as well, so that data is stored twice; but the index 
itself will anyway have two copies of the same document usually. Copying binary 
with a shared datastores would not need to extract text twice.

> Faster multi threaded indexing for binary content
> -------------------------------------------------
>
>                 Key: OAK-2787
>                 URL: https://issues.apache.org/jira/browse/OAK-2787
>             Project: Jackrabbit Oak
>          Issue Type: Wish
>          Components: lucene
>            Reporter: Chetan Mehrotra
>
> With Lucene based indexing the indexing process is single threaded. This 
> hamper the indexing of binary content as on a multi processor system only 
> single thread can be used to perform the indexing
> [~ianeboston] Suggested a possible approach [1] involving a 2 phase indexing
> # In first phase detect the nodes to be indexed and start the full text 
> extraction of the binary content. Post extraction save the binary token 
> stream back to the node as a hidden data. In this phase the node properties 
> can still be indexed and a marker field would be added to indicate the 
> fulltext index is still pending
> # Later in 2nd phase look for all such Lucene docs and then update them with 
> the saved token stream
> This would allow the text extraction logic to be decouple from Lucene 
> indexing logic
> [1] http://markmail.org/thread/2w5o4bwqsosb6esu



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-2787) Faster multi threaded indexing for binary content

Reply via email to