Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Jeff Widman Thu, 02 Nov 2017 18:15:45 -0700

+1 for permanent retry under the covers (without an exposed/later
deprecated config).


That said, I understand the reality that sometimes we have to workaround an
unfixed issue in another project, so if you think best to expose a config,
then I have no objections. Mainly I wanted to make sure you'd tried to get
upstream to fix as that is almost always a cleaner solution.

> The above fact implies some reluctance from the zookeeper community to fully
solve the issue (maybe due to technical issues).

@Ted - I spent some time a few months ago poking through issues on the ZK
issue tracker, and it looked like there wasn't much activity on the project
lately. So my guess is that it's less about problems with this particular
solution, and more that the solution has just enough moving parts that no
one with commit rights has had the time to review it. As a volunteer
maintainer on a number of projects, I certainly empathize with them,
although it would be nice to get some more committers onto the Zookeeper
project who have the time to review some of these semi-abandoned PRs and
either accept or reject them.



On Thu, Nov 2, 2017 at 3:00 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Stephane:
> bq. hasn't acted in over a year
>
> The above fact implies some reluctance from the zookeeper community to
> fully solve the issue (maybe due to technical issues).
> Anyway, we should plan on not relying on the fix to go through in the near
> future.
>
> As for Jun's latest suggestion, I think we should add periodic logging
> indicating the retry.
>
> A KIP is not needed if we go that route.
>
> Cheers
>
> On Thu, Nov 2, 2017 at 2:54 PM, Stephane Maarek <
> steph...@simplemachines.com.au> wrote:
>
> > Hi Jun
> >
> > I think this is a better option. Would that change require a kip then as
> > it's not a change in public API ?
> >
> > @ted it was marked as a blocked for 3.4.11 but they pushed it. It seems
> > that the owner of the pr hasn't acted in over a year and I think someone
> > needs to take ownership of that. Additionally, this would be a change in
> > Kafka zookeeper client dependency, so no need to update your zookeeper
> > quorum to benefit from the change
> >
> > Thanks
> > Stéphane
> >
> >
> > On 3 Nov. 2017 8:45 am, "Jun Rao" <j...@confluent.io> wrote:
> >
> > Stephane, Jeff,
> >
> > Another option is to not expose the reconnect timeout config and just
> retry
> > the creation of Zookeeper forever. This is an improvement from the
> current
> > situation and if zookeeper-2184 is fixed in the future, we don't need to
> > deprecate the config.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown.
> > >
> > > I think adding the session recreation on Kafka side should benefit
> Kafka
> > > users, especially those who don't plan to move to 3.4.12+ in the near
> > > future.
> > >
> > > On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <j...@confluent.io> wrote:
> > >
> > > > Hi, Stephane,
> > > >
> > > > 3) The difference is that currently, there is no retry when
> re-creating
> > > the
> > > > Zookeeper object when a ZK session expires. So, if the re-creation of
> > > > Zookeeper fails, the broker just logs the error and the Zookeeper
> > object
> > > > will never be created again. With this KIP, we will keep retrying the
> > > > creation of Zookeeper until success.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> > > > steph...@simplemachines.com.au> wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 1) The reason I'm asking about it is I wonder if it's not worth
> > > focusing
> > > > > the development efforts on taking ownership of the existing PR (
> > > > > https://github.com/apache/zookeeper/pull/150)  to fix
> > ZOOKEEPER-2184,
> > > > > rebase it and have it merged into the ZK codebase shortly.  I feel
> > this
> > > > KIP
> > > > > might introduce a setting that could be deprecated shortly and
> > confuse
> > > > the
> > > > > end user a bit further with one more knob to turn.
> > > > >
> > > > > 3) I'm not sure if I fully understand, sorry for the beginner's
> > > question:
> > > > > if the default timeout is infinite, then it won't change anything
> to
> > > how
> > > > > Kafka works from today, does it? (unless I'm missing something
> > sorry).
> > > If
> > > > > not set to infinite, then we introduce the risk of a whole cluster
> > > > shutting
> > > > > down at once?
> > > > >
> > > > > Thanks,
> > > > > Stephane
> > > > >
> > > > > On 31/10/17, 1:00 pm, "Jun Rao" <j...@confluent.io> wrote:
> > > > >
> > > > >     Hi, Stephane,
> > > > >
> > > > >     Thanks for the reply.
> > > > >
> > > > >     1) Fixing the issue in ZK will be ideal. Not sure when it will
> > > happen
> > > > >     though. Once it's fixed, we can probably deprecate this config.
> > > > >
> > > > >     2) That could be useful. Is there a java api to do that at
> > runtime?
> > > > > Also,
> > > > >     invalidating DNS cache doesn't always fix the issue of
> unresolved
> > > > > host. In
> > > > >     some of the cases, human intervention is needed.
> > > > >
> > > > >     3) The default timeout is infinite though.
> > > > >
> > > > >     Jun
> > > > >
> > > > >
> > > > >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> > > > >     steph...@simplemachines.com.au> wrote:
> > > > >
> > > > >     > Hi Jun,
> > > > >     >
> > > > >     > I think this is very helpful. Restarting Kafka brokers in
> case
> > of
> > > > > zookeeper
> > > > >     > host change is not a well known operation.
> > > > >     >
> > > > >     > Few questions:
> > > > >     > 1) would it not be worth fixing the problem at the source ?
> > This
> > > > has
> > > > > been
> > > > >     > stuck for a while though, maybe a little push would help :
> > > > >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > > > > issue/ZOOKEEPER-2184
> > > > >     >
> > > > >     > 2) upon recreating the zookeeper object , is it not possible
> to
> > > > > invalidate
> > > > >     > the DNS cache so that it resolves the new hostname?
> > > > >     >
> > > > >     > 3) could the cluster be down in this situation: one migrates
> an
> > > > > entire
> > > > >     > zookeeper cluster to new machines (one by one). The quorum is
> > > still
> > > > > alive
> > > > >     > without downtime, but now every broker in a cluster can't
> > resolve
> > > > > zookeeper
> > > > >     > at the same time. They all shut down at the same time after
> the
> > > new
> > > > >     > time-out setting.
> > > > >     >
> > > > >     > Thanks !
> > > > >     > Stéphane
> > > > >     >
> > > > >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <j...@confluent.io> wrote:
> > > > >     >
> > > > >     > > Hi, Everyone,
> > > > >     > >
> > > > >     > > We created "KIP-217: Expose a timeout to allow an expired
> ZK
> > > > > session to
> > > > >     > be
> > > > >     > > re-created".
> > > > >     > >
> > > > >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > > > > to+be+re-created
> > > > >     > >
> > > > >     > > Please take a look and provide your feedback.
> > > > >     > >
> > > > >     > > Thanks,
> > > > >     > >
> > > > >     > > Jun
> > > > >     > >
> > > > >     >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>



-- 

*Jeff Widman*
jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)
<><

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Reply via email to