Replicator PrimaryNode waits forever for remotes to close

Steven Schlansker Wed, 29 Jun 2022 16:36:12 -0700

Hi Lucene fans,

We use lucene-replicator to copy our indexes from a primary to replica nodes.
Usually, startup and shutdown are fine. In particular we call PrimaryNode.close.


But, in some edge cases - dropped connection? IOException? some process 
crashed? -
we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never returns.

I suspect we have a reference counting bug: in some exceptional case, we forget 
to release our CopyState.
This definitely should be fixed, but in the meantime, it's very unhelpful for 
the primary node to never come down.

I was considering submitting a PR to add a configurable timeout for the 
shutdown wait - and after the timeout expires,
continue with closing even though some replicas did not terminate.
They will possibly crash with an "IOException: directory closed" later, or 
maybe never come back at all.

Does this sound like a welcome change? Is there a better way to avoid hanging 
here, other than to be bug-free?
It's quite challenging to figure out where the CopyState wasn't released, as 
only a count is kept.

Thanks!

Steven Schlansker


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Replicator PrimaryNode waits forever for remotes to close

Reply via email to