+1 to provide a timeout, or, to simply fix close to aggressively close regardless of what the replicas are doing?
It's not a great design for primary to be so dependent on the replicas (but vice/versa makes sense?). Maybe open a Jira issue or starting PR so we can discuss? Thanks for uncovering this and proposing a fix! Mike McCandless http://blog.mikemccandless.com On Wed, Jun 29, 2022 at 7:36 PM Steven Schlansker < stevenschlans...@gmail.com> wrote: > Hi Lucene fans, > > We use lucene-replicator to copy our indexes from a primary to replica > nodes. > Usually, startup and shutdown are fine. In particular we call > PrimaryNode.close. > > But, in some edge cases - dropped connection? IOException? some process > crashed? - > we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never > returns. > > I suspect we have a reference counting bug: in some exceptional case, we > forget to release our CopyState. > This definitely should be fixed, but in the meantime, it's very unhelpful > for the primary node to never come down. > > I was considering submitting a PR to add a configurable timeout for the > shutdown wait - and after the timeout expires, > continue with closing even though some replicas did not terminate. > They will possibly crash with an "IOException: directory closed" later, or > maybe never come back at all. > > Does this sound like a welcome change? Is there a better way to avoid > hanging here, other than to be bug-free? > It's quite challenging to figure out where the CopyState wasn't released, > as only a count is kept. > > Thanks! > > Steven Schlansker > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >