Re: Replicator PrimaryNode waits forever for remotes to close

Steven Schlansker Fri, 01 Jul 2022 13:11:45 -0700

> On Jun 30, 2022, at 10:40 AM, Michael McCandless <[email protected]> 
> wrote:
> 
> +1 to provide a timeout, or, to simply fix close to aggressively close 
> regardless of what the replicas are doing?

Yes, aggressively closing would be great for us - we already expect the primary 
can and will crash, so an aggressive close is no worse than that.
I proposed the timeout on the theory that There Must Be A Reason It Is This Way 
:) but if the simpler solution is acceptable that's great for us!

> It's not a great design for primary to be so dependent on the replicas (but 
> vice/versa makes sense?).

In our case, we use stateless HTTP to do the replication instead of the 
stateful sockets the reference implementation does.
This makes the reference counting for CopyState a little messy but has other 
benefits that for us outweigh the costs.
So for us, I think this might be the only place the primary depends on the 
replicas at all, and it'd be wonderful to break that dependency.

> Maybe open a Jira issue or starting PR so we can discuss?

I filed https://issues.apache.org/jira/browse/LUCENE-10638 for further 
discussion. Thanks!

> Thanks for uncovering this and proposing a fix!
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Wed, Jun 29, 2022 at 7:36 PM Steven Schlansker 
> <[email protected]> wrote:
> Hi Lucene fans,
> 
> We use lucene-replicator to copy our indexes from a primary to replica nodes.
> Usually, startup and shutdown are fine. In particular we call 
> PrimaryNode.close.
> 
> But, in some edge cases - dropped connection? IOException? some process 
> crashed? -
> we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never 
> returns.
> 
> I suspect we have a reference counting bug: in some exceptional case, we 
> forget to release our CopyState.
> This definitely should be fixed, but in the meantime, it's very unhelpful for 
> the primary node to never come down.
> 
> I was considering submitting a PR to add a configurable timeout for the 
> shutdown wait - and after the timeout expires,
> continue with closing even though some replicas did not terminate.
> They will possibly crash with an "IOException: directory closed" later, or 
> maybe never come back at all.
> 
> Does this sound like a welcome change? Is there a better way to avoid hanging 
> here, other than to be bug-free?
> It's quite challenging to figure out where the CopyState wasn't released, as 
> only a count is kept.
> 
> Thanks!
> 
> Steven Schlansker
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: Replicator PrimaryNode waits forever for remotes to close

Reply via email to