Re: Should IndexWriter.flush return seqNo?
> Patrick maybe you had an interesting use case in mind? I had one, but later on I found out that I don't necessarily use flush to achieve that so it's not really a valid use case that definitely need flush... On Tue, Apr 25, 2023 at 7:26 PM Ishan Chattopadhyaya < ichattopadhy...@gmail.com> wrote: > I think Apache Solr could explore leveraging the returned sequence number > for its transaction logs. > > On Tue, 25 Apr 2023 at 18:36, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> On Sun, Apr 23, 2023 at 6:19 AM Uwe Schindler wrote: >> >> Having the sequence number public in API does not bring any benefit, as >>> you cannot use it for anything. >>> >> >> Actually there are some interesting use cases for sequence numbers: >> >> They enable the caller to know the effective order of operations of >> concurrent indexing events. This can be useful for applications that might >> sometimes update the same document at the same time across threads to >> implement optimistic concurrency to re-index the same document if the order >> was not correct according to the applications external version tracking for >> out-of-order updates. OpenSearch has an array of locks to implement >> pessimistic concurrency (ensuring the that same id is never updated >> concurrently) but for cases where the conflicts are rare, the optimistic >> implementation based on Lucene's sequence numbers is likely more efficient. >> >> Another use case is precise indexing operation replay (e.g. from a >> Kinesis queue or transaction log or whatever) on recovering from a commit >> point: upon commit, you know which precise indexing event was captured in >> the commit, and on recovering you can resume indexing from precisely the >> next indexing event. This doesn't matter for idempotent updates, but, for >> other cases like append only, it is useful and performant. >> >> I also don't see why flush should return a sequence number -- it is not >> an externally visible event. Patrick maybe you had an interesting use case >> in mind? Note that commit also writes (and fsyncs) the next segments_N >> file, to light all the newly written/fsync'd segments for the next reader >> to open. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >>
Re: Should IndexWriter.flush return seqNo?
I think Apache Solr could explore leveraging the returned sequence number for its transaction logs. On Tue, 25 Apr 2023 at 18:36, Michael McCandless wrote: > On Sun, Apr 23, 2023 at 6:19 AM Uwe Schindler wrote: > > Having the sequence number public in API does not bring any benefit, as >> you cannot use it for anything. >> > > Actually there are some interesting use cases for sequence numbers: > > They enable the caller to know the effective order of operations of > concurrent indexing events. This can be useful for applications that might > sometimes update the same document at the same time across threads to > implement optimistic concurrency to re-index the same document if the order > was not correct according to the applications external version tracking for > out-of-order updates. OpenSearch has an array of locks to implement > pessimistic concurrency (ensuring the that same id is never updated > concurrently) but for cases where the conflicts are rare, the optimistic > implementation based on Lucene's sequence numbers is likely more efficient. > > Another use case is precise indexing operation replay (e.g. from a Kinesis > queue or transaction log or whatever) on recovering from a commit point: > upon commit, you know which precise indexing event was captured in the > commit, and on recovering you can resume indexing from precisely the next > indexing event. This doesn't matter for idempotent updates, but, for other > cases like append only, it is useful and performant. > > I also don't see why flush should return a sequence number -- it is not an > externally visible event. Patrick maybe you had an interesting use case in > mind? Note that commit also writes (and fsyncs) the next segments_N file, > to light all the newly written/fsync'd segments for the next reader to open. > > Mike McCandless > > http://blog.mikemccandless.com > >
Re: Should IndexWriter.flush return seqNo?
On Sun, Apr 23, 2023 at 6:19 AM Uwe Schindler wrote: Having the sequence number public in API does not bring any benefit, as > you cannot use it for anything. > Actually there are some interesting use cases for sequence numbers: They enable the caller to know the effective order of operations of concurrent indexing events. This can be useful for applications that might sometimes update the same document at the same time across threads to implement optimistic concurrency to re-index the same document if the order was not correct according to the applications external version tracking for out-of-order updates. OpenSearch has an array of locks to implement pessimistic concurrency (ensuring the that same id is never updated concurrently) but for cases where the conflicts are rare, the optimistic implementation based on Lucene's sequence numbers is likely more efficient. Another use case is precise indexing operation replay (e.g. from a Kinesis queue or transaction log or whatever) on recovering from a commit point: upon commit, you know which precise indexing event was captured in the commit, and on recovering you can resume indexing from precisely the next indexing event. This doesn't matter for idempotent updates, but, for other cases like append only, it is useful and performant. I also don't see why flush should return a sequence number -- it is not an externally visible event. Patrick maybe you had an interesting use case in mind? Note that commit also writes (and fsyncs) the next segments_N file, to light all the newly written/fsync'd segments for the next reader to open. Mike McCandless http://blog.mikemccandless.com
Re: Should IndexWriter.flush return seqNo?
> > Yes thats true, I just have to add: You can still open a NRT reader > directly from IndexWriter. But you don't need a sequence number there as > its hidden completely. So flushing is fine to allow users to get a new > NRT reader with the state up to that point, but it does not need to > return anything. > Uwe, sorry, I must correct you: flushing doesnt do that. It doesn't allow you to get an NRT reader or any other type of reader. it is the same as if you filled up the RAMBuffer with documents, that is all. If you want NRTReader you should be calling openIfChanged (and calling flush yourself is irrelevant/unnecessary). The two methods are completely separate, to me unrelated. That's why flush makes no sense in the api. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should IndexWriter.flush return seqNo?
Hi, Am 21.04.2023 um 16:16 schrieb Robert Muir: This is not true: if i call IndexWriter.commit, then i can open an indexreader and see the documents. IndexWriter.flush doesn't do anything at all, really, just moves stuff from RAM to disk but not in a way that indexreader can see it or anything, right? Yes thats true, I just have to add: You can still open a NRT reader directly from IndexWriter. But you don't need a sequence number there as its hidden completely. So flushing is fine to allow users to get a new NRT reader with the state up to that point, but it does not need to return anything. Having the sequence number public in API does not bring any benefit, as you cannot use it for anything. It doesn't make much sense that this method is public in the API, definitely adding sequence number makes no sense since nothing was committed here. +1 On Thu, Apr 20, 2023 at 1:28 AM Patrick Zhai wrote: Hi folks, I just realized that while "commit" returns the sequence number which represents the latest event that committed in the index, "flush" still returns nothing. Since they're essentially the same except fsync I wonder whether there's any specific reason to not do so? Best Patrick - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Should IndexWriter.flush return seqNo?
Hi Rob, Thanks for explaining, that makes sense to me. Patrick On Fri, Apr 21, 2023 at 7:18 AM Robert Muir wrote: > This is not true: if i call IndexWriter.commit, then i can open an > indexreader and see the documents. > > IndexWriter.flush doesn't do anything at all, really, just moves stuff > from RAM to disk but not in a way that indexreader can see it or > anything, right? > > It doesn't make much sense that this method is public in the API, > definitely adding sequence number makes no sense since nothing was > committed here. > > On Thu, Apr 20, 2023 at 1:28 AM Patrick Zhai wrote: > > > > Hi folks, > > I just realized that while "commit" returns the sequence number which > represents the latest event that committed in the index, "flush" still > returns nothing. Since they're essentially the same except fsync I wonder > whether there's any specific reason to not do so? > > > > Best > > Patrick > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Should IndexWriter.flush return seqNo?
This is not true: if i call IndexWriter.commit, then i can open an indexreader and see the documents. IndexWriter.flush doesn't do anything at all, really, just moves stuff from RAM to disk but not in a way that indexreader can see it or anything, right? It doesn't make much sense that this method is public in the API, definitely adding sequence number makes no sense since nothing was committed here. On Thu, Apr 20, 2023 at 1:28 AM Patrick Zhai wrote: > > Hi folks, > I just realized that while "commit" returns the sequence number which > represents the latest event that committed in the index, "flush" still > returns nothing. Since they're essentially the same except fsync I wonder > whether there's any specific reason to not do so? > > Best > Patrick - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org