Re: Reindex and external indexes - Possibility of stale index data

Thomas Mueller Mon, 13 Oct 2014 01:47:03 -0700

Hi,

As for external Lucene indexes, what about this:


* in the ":data" node, store a index creation time, in milliseconds since
1970
* use that as a path prefix for the actual index files

So if the index is configured as follows:

  /oak:index/lucene { path: "/quickstart/repo/lucenIndex" }

Then internally, Oak Lucene would create a node

  /oak:index/lucene/:dataInProgress { time: 1413189793297 }

Then would use that UUID as the prefix for the directory, and the index is
created in:

  /quickstart/repo/lucenIndex/1413189793297

When the index is built, the node ":dataInProgress" is renamed to ":data":

  /oak:index/lucene/:data { time: 1413189793297 }

To read, this the directory would be used. When reindexing, then
temporarily two nodes and directories would exist:

  /oak:index/lucene/:data { time: 1413189793297 }
  /oak:index/lucene/:dataInProgress { time: 1413189822022 }

  /quickstart/repo/lucenIndex/1413189793297

  /quickstart/repo/lucenIndex/1413189822022

Once the index is done, in one transaction, the old ":data" node is
removed and the node ":dataInProgress" is removed to ":data". Then the old
directories are removed.

You can only reindex once per millisecond, but I guess this isn't a
problem.

Regards,
Thomas






On 13/10/14 10:29, "Alex Parvulescu" <[email protected]> wrote:

>Hi,
>
>
>> If we set reindex to true in any index definition then Oak would
>> remove the existing index content before performing the reindex. This
>> would work fine if the index content are stored within NodeStore
>> itself.
>
>It is important to also specify that this appears as a single commit
>thanks
>to the mvcc model: (delete + set reindexed index) so there's no downtime
>to
>speak of, the original index is available during the reindex process.
>
>
>> However if the index are stored externally e.g. Solr or Lucene index
>> with persistence set to filesystem then I think currently we do not
>> the remove the existing index data which might lead to index
>> containing stale data.
>
>Agreed, this is a problem when storing the index outside the repo. The
>interesting part here is that only content updates might be affected,
>deleting a node will not resurface it thanks to the fact that the query
>engine will reload nodes to see if they are readable to the current
>session
>(acl checks) so it skips over the nodes it can't read, if I remember
>correctly.
>
>Focusing on the Lucene index now, I went through the code a bit (no proper
>tests yet) and it looks like it might not be affected by this that much. A
>reindex call has the before state empty so Lucene will update all the
>documents it finds [0], so no stale content on updates here. Just missing
>deleted node events.
>So the remaining question is about identifying content that was deleted
>between the indexed state and the current head state. One simple solution
>is to run a 'remove all documents query' on the lucene index, but that has
>the downside of making the index unusable during the time the indexing
>process runs, so I don't see it as a really good option, only maybe as a
>fallback of sorts.
>
>
>> Should we provide any sort of callback for indexers when reindex is
>requested?
>Thinking about this a bit, there's a simpler way of handling a reindex
>call. If you really need to know that the current index is actually a
>reindex call, you can check if the before state is the empty one on the
>root index editor.
>
>best,
>alex
>
>[0]
>https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/ja
>va/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndexEditor.java#L
>109
>
>
>
>On Mon, Oct 13, 2014 at 7:33 AM, Chetan Mehrotra
><[email protected]>
>wrote:
>
>> Hi,
>>
>> If we set reindex to true in any index definition then Oak would
>> remove the existing index content before performing the reindex. This
>> would work fine if the index content are stored within NodeStore
>> itself.
>>
>> However if the index are stored externally e.g. Solr or Lucene index
>> with persistence set to filesystem then I think currently we do not
>> the remove the existing index data which might lead to index
>> containing stale data.
>>
>> Should we provide any sort of callback for indexers when reindex is
>> requested?
>>
>> Chetan Mehrotra
>>

Re: Reindex and external indexes - Possibility of stale index data

Reply via email to