+1 to provide a timeout, or, to simply fix close to aggressively close
regardless of what the replicas are doing?

It's not a great design for primary to be so dependent on the replicas (but
vice/versa makes sense?).

Maybe open a Jira issue or starting PR so we can discuss?

Thanks for uncovering this and proposing a fix!

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 29, 2022 at 7:36 PM Steven Schlansker <
stevenschlans...@gmail.com> wrote:

> Hi Lucene fans,
>
> We use lucene-replicator to copy our indexes from a primary to replica
> nodes.
> Usually, startup and shutdown are fine. In particular we call
> PrimaryNode.close.
>
> But, in some edge cases - dropped connection? IOException? some process
> crashed? -
> we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never
> returns.
>
> I suspect we have a reference counting bug: in some exceptional case, we
> forget to release our CopyState.
> This definitely should be fixed, but in the meantime, it's very unhelpful
> for the primary node to never come down.
>
> I was considering submitting a PR to add a configurable timeout for the
> shutdown wait - and after the timeout expires,
> continue with closing even though some replicas did not terminate.
> They will possibly crash with an "IOException: directory closed" later, or
> maybe never come back at all.
>
> Does this sound like a welcome change? Is there a better way to avoid
> hanging here, other than to be bug-free?
> It's quite challenging to figure out where the CopyState wasn't released,
> as only a count is kept.
>
> Thanks!
>
> Steven Schlansker
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to