[GitHub] michael-trelinski commented on issue #6740: Zookeeper loss

GitBox Fri, 14 Dec 2018 23:02:04 -0800

michael-trelinski commented on issue #6740: Zookeeper loss
URL: https://github.com/apache/incubator-druid/pull/6740#issuecomment-447543421

   Hi Sascha,

   Thank you for your interest in my PR.

   I would like to make a couple of comments (some are non-sequiters) about why 
I did what I did:

   1.  Fixing this in Curator (or Zookeeper) would affect a whole lot of 
systems (and systems of systems).  I believe I have tested this well, but I 
would prefer not to interfere with every other system that uses Zookeeper on 
the planet’s stability.  If Curator is interested in my workaround, they may 
have it.  As it is now (pending PR approval), it would be available by 
explicitly enabling a config file option and sandboxes the change only to 
Druid.  I feel that this is the best way to roll out something this monumental.

   2.  After researching the problem (and also simulating why the RetryPolicy 
implementations didn’t catch this problem completely), I determined that 
Zookeeper itself is never really sure when a connection to a server is down or 
the service is down.  In my opinion, Curator had the best foundation for 
handling this scenario.

   3.  Our reason for this fix had a very specific cause: our DNS response for 
our Zookeeper servers changed IP addresses, and Java seemed to cache them 
forever.  Even with the BoundedExponentialBackoff retry policy timing 
out/giving up, Zookeeper would loop forever and keep a thread alive, which 
would prevent a clean JVM exit.  We would prefer that the service die (and 
allow a daemon to restart it) rather than produce bad data.  This may not fit 
everyone’s use case.

   4.  I am excited for my first meaty Druid PR, and I hope that the community 
accepts this change.

   Thank you.

   Best,
   Mike

   > On Dec 14, 2018, at 5:27 PM, Sascha Coenen <[email protected]> 
wrote:
   > 
   > Hi. I'm running Druid on Kubernetes and the fact that the Curator library 
is not dealing with connection disruptions properly is the one single remaining 
issue that prevents my setup from being resilient. So I love your motion as it 
is much needed. I'm not a committer, so I cannot comment on whether the 
approach chosen is sound, but in one way or another, something has to be done 
for sure.
   > 
   > As as side note, out of shear curiosity I wonder whether there is any good 
reason to use Curator at all. I know that there are plans to move away from 
Zookeeper, but it is not Zookeeper that is causing all the issues - it is the 
Curator client library that is the culprit. I wonder why the Druid Devs don't 
just throw out Curator and be done with it. Would be much easier than getting 
rid of Zookeeper, which they can still do, but as a migration path action, I 
would throw out Curator first.
   > 
   > The most natural thing to do would have been to fix the following Curator 
issue:
   > https://issues.apache.org/jira/browse/CURATOR-229
   > But this issue has been reported in 2015 and it still did not get fixed 
although it is a severe bug that is impeding many people.
   > 
   > Druid is also not using the most recent version of Curator if I'm not 
mistaken. Version 4.0.1 has been released in Feb 2018 while Druid master uses 
4.0.0.
   > 
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub, or mute the thread.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] michael-trelinski commented on issue #6740: Zookeeper loss

Reply via email to