michael-trelinski commented on issue #6740: Zookeeper loss URL: https://github.com/apache/incubator-druid/pull/6740#issuecomment-447543421 Hi Sascha, Thank you for your interest in my PR. I would like to make a couple of comments (some are non-sequiters) about why I did what I did: 1. Fixing this in Curator (or Zookeeper) would affect a whole lot of systems (and systems of systems). I believe I have tested this well, but I would prefer not to interfere with every other system that uses Zookeeper on the planet’s stability. If Curator is interested in my workaround, they may have it. As it is now (pending PR approval), it would be available by explicitly enabling a config file option and sandboxes the change only to Druid. I feel that this is the best way to roll out something this monumental. 2. After researching the problem (and also simulating why the RetryPolicy implementations didn’t catch this problem completely), I determined that Zookeeper itself is never really sure when a connection to a server is down or the service is down. In my opinion, Curator had the best foundation for handling this scenario. 3. Our reason for this fix had a very specific cause: our DNS response for our Zookeeper servers changed IP addresses, and Java seemed to cache them forever. Even with the BoundedExponentialBackoff retry policy timing out/giving up, Zookeeper would loop forever and keep a thread alive, which would prevent a clean JVM exit. We would prefer that the service die (and allow a daemon to restart it) rather than produce bad data. This may not fit everyone’s use case. 4. I am excited for my first meaty Druid PR, and I hope that the community accepts this change. Thank you. Best, Mike > On Dec 14, 2018, at 5:27 PM, Sascha Coenen <[email protected]> wrote: > > Hi. I'm running Druid on Kubernetes and the fact that the Curator library is not dealing with connection disruptions properly is the one single remaining issue that prevents my setup from being resilient. So I love your motion as it is much needed. I'm not a committer, so I cannot comment on whether the approach chosen is sound, but in one way or another, something has to be done for sure. > > As as side note, out of shear curiosity I wonder whether there is any good reason to use Curator at all. I know that there are plans to move away from Zookeeper, but it is not Zookeeper that is causing all the issues - it is the Curator client library that is the culprit. I wonder why the Druid Devs don't just throw out Curator and be done with it. Would be much easier than getting rid of Zookeeper, which they can still do, but as a migration path action, I would throw out Curator first. > > The most natural thing to do would have been to fix the following Curator issue: > https://issues.apache.org/jira/browse/CURATOR-229 > But this issue has been reported in 2015 and it still did not get fixed although it is a severe bug that is impeding many people. > > Druid is also not using the most recent version of Curator if I'm not mistaken. Version 4.0.1 has been released in Feb 2018 while Druid master uses 4.0.0. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub, or mute the thread.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
