Hi Lucene fans, We use lucene-replicator to copy our indexes from a primary to replica nodes. Usually, startup and shutdown are fine. In particular we call PrimaryNode.close.
But, in some edge cases - dropped connection? IOException? some process crashed? - we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never returns. I suspect we have a reference counting bug: in some exceptional case, we forget to release our CopyState. This definitely should be fixed, but in the meantime, it's very unhelpful for the primary node to never come down. I was considering submitting a PR to add a configurable timeout for the shutdown wait - and after the timeout expires, continue with closing even though some replicas did not terminate. They will possibly crash with an "IOException: directory closed" later, or maybe never come back at all. Does this sound like a welcome change? Is there a better way to avoid hanging here, other than to be bug-free? It's quite challenging to figure out where the CopyState wasn't released, as only a count is kept. Thanks! Steven Schlansker --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org