Had a offline chat with Thomas on that and for now creationTime based approach can be used to allow index logic to distinguish between reindex and fresh index.
Thomas proposal above was more to avoid large transaction problem where new index would be build side by side. With Lucene this is not a big issue as only binary references constitute state of the unmerged branch. If such an issue is identified later then we can look into implementing above mentioned mechanism Opened OAK-2229 for this Chetan Mehrotra On Tue, Oct 21, 2014 at 3:09 PM, Thomas Mueller <[email protected]> wrote: > Hi, > > Yes, that's my point. I wouldn't use MVCC for reindexing the Lucene index. > Reindexing is very costly, and I wouldn't do it in one huge, and possibly > hours long transaction. > > * You need to have access to the old and (for readers) the new data (to > re-create the index) > * Eventually, you want to remove the old data (possibly piece by piece) > * You may need to map the structure to a file system, which means separate > directories > > Regards, > Thomas > > > > On 21/10/14 11:19, "Chetan Mehrotra" <[email protected]> wrote: > >>Thanks for the details Thomas! >> >>But above model varies from current model which make use of MVCC. The >>reindex operation triggers removal of :data node in branch and >>IndexReader always looks for :data node to open the directory on >>trunk. So while reindex is in progress existing readers make use of >>the node which is not seen as removed in trunk. >> >>What I need is just a way to differentiate index state for a reindex >>call and that can be managed easily via storing the creation time in >>the index definition node which works easily with existing logic >>Chetan Mehrotra >> >> >>On Tue, Oct 21, 2014 at 1:51 PM, Thomas Mueller <[email protected]> wrote: >>> Hi, >>> >>> The node doesn't need to be moved, even after multiple reindex >>>operations. >>> Please note index creation is no different from reindex. In both cases, >>>a >>> new index data node is created. So, if an index definition is created: >>> >>> /oak:index/lucene >>> >>> Then the index is being built: >>> >>> /oak:index/lucene/:data_12345 >>> >>> The index is done building (a): >>> >>> /oak:index/lucene/:data_12345/@ready=true >>> >>> Reindexing is started (b): >>> >>> /oak:index/lucene/@reindex=true >>> /oak:index/lucene/:data_12345/@ready=true >>> >>> >>> While reindex is in progress: >>> >>> /oak:index/lucene/@reindex=true >>> /oak:index/lucene/:data_12345/@ready=true >>> /oak:index/lucene/:data_14444 >>> >>> >>> When reindex is done (matches a): >>> >>> /oak:index/lucene/:data_14444/@ready=true >>> >>> Reindex again is just restart from (b). >>> >>> Regards, >>> Thomas >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On 21/10/14 10:00, "Chetan Mehrotra" <[email protected]> wrote: >>> >>>>On Tue, Oct 21, 2014 at 1:18 PM, Thomas Mueller <[email protected]> >>>>wrote: >>>>> What we need is a distinction between the old and the new index >>>>>*data*. >>>> >>>>Yes and that can be done by storing the index creation time. >>>> >>>>In the approach you suggested where two different nodes are used and >>>>later the nodes are renamed allows the logic to determine that its >>>>reindex. Renaming the node would be fine in this case as actual data >>>>is stored on filesystem but if it contains actual data then such a >>>>move might be costly. For e.g. in copy on read case the index data >>>>would be stored in NodeStore and also on file system. Further this is >>>>something which each such index implementation would need to follow >>>> >>>>Instead if we just record the creation time in the index definition >>>>node and then allow index impls to make use of that info to >>>>distinguish between a reindex and incremental index then that would >>>>serve the same purpose >>>> >>>> >>>>Chetan Mehrotra >>> >
