Hi Colin,

That’s fair though I am unsure if a delay + metric + log message would
really serve our purpose. There would be no action required from the
operator in almost all cases. A signal that is not actionable in 99% cases
may not be very useful, in my opinion.

Additionally, if we add in a delay, we would need to reason about the
behavior when the same topic is recreated while a stray partition has been
queued for deletion.

I would be in support of adding a configuration to disable stray partition
deletion. This way, if users find abnormal behavior when testing /
upgrading development environments, they could choose to disable the
feature altogether.

Let me know what you think. It would be good to hear what others think as
well.

Thanks,
Dhruvil

On Thu, Jan 16, 2020 at 3:24 AM Colin McCabe <cmcc...@apache.org> wrote:

> On Wed, Jan 15, 2020, at 03:54, Dhruvil Shah wrote:
> > Hi Colin,
> >
> > We could add a configuration to disable stray partition deletion if
> needed,
> > but I wasn't sure if an operator would really want to disable it. Perhaps
> > if the implementation were buggy, the configuration could be used to
> > disable the feature until a bug fix is made. Is that the kind of use case
> > you were thinking of?
> >
> > I was thinking that there would not be any delay between detection and
> > deletion of stray logs. We would schedule an async task to do the actual
> > deletion though.
>
> Based on my experience in HDFS, immediately deleting data that looks out
> of place can cause severe issues when a bug occurs.  See
> https://issues.apache.org/jira/browse/HDFS-6186 for details.  So I really
> do think there should be a delay, and a metric + log message in the
> meantime to alert the operators to what is about to happen.
>
> best,
> Colin
>
> >
> > Thanks,
> > Dhruvil
> >
> > On Tue, Jan 14, 2020 at 11:04 PM Colin McCabe <cmcc...@apache.org>
> wrote:
> >
> > > Hi Dhruvil,
> > >
> > > Thanks for the KIP.  I think there should be some way to turn this
> off, in
> > > case that becomes necessary.  I'm also curious how long we intend to
> wait
> > > between detecting the duplication and  deleting the extra logs.  The
> KIP
> > > says "scheduled for deletion" but doesn't give a time frame -- is it
> > > assumed to be immediate?
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Tue, Jan 14, 2020, at 05:56, Dhruvil Shah wrote:
> > > > If there are no more questions or concerns, I will start a vote
> thread
> > > > tomorrow.
> > > >
> > > > Thanks,
> > > > Dhruvil
> > > >
> > > > On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah <dhru...@confluent.io>
> > > wrote:
> > > >
> > > > > Hi Nikhil,
> > > > >
> > > > > Thanks for looking at the KIP. The kind of race condition you
> mention
> > > is
> > > > > not possible as stray partition detection is done synchronously
> while
> > > > > handling the LeaderAndIsrRequest. In other words, we atomically
> > > evaluate
> > > > > the partitions the broker must host and the extra partitions it is
> > > hosting
> > > > > and schedule deletions based on that.
> > > > >
> > > > > One possible shortcoming of the KIP is that we do not have the
> ability
> > > to
> > > > > detect a stray partition if the topic has been recreated since. We
> will
> > > > > have the ability to disambiguate between different generations of a
> > > > > partition with KIP-516.
> > > > >
> > > > > Thanks,
> > > > > Dhruvil
> > > > >
> > > > > On Sat, Jan 11, 2020 at 11:40 AM Nikhil Bhatia <
> nik...@confluent.io>
> > > > > wrote:
> > > > >
> > > > >> Thanks Dhruvil, the proposal looks reasonable to me.
> > > > >>
> > > > >> is there a potential of a race between a new topic being assigned
> to
> > > the
> > > > >> same node that is still performing a cleanup of the stray
> partition ?
> > > > >> Topic
> > > > >> ID will definitely solve this issue.
> > > > >>
> > > > >> Thanks
> > > > >> Nikhil
> > > > >>
> > > > >> On 2020/01/06 04:30:20, Dhruvil Shah <d...@confluent.io> wrote:
> > > > >> > Here is the link to the KIP:>
> > > > >> >
> > > > >>
> > > > >>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker
> > > > >> >
> > > > >>
> > > > >> >
> > > > >> > On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah <dh...@confluent.io
> >
> > > > >> wrote:>
> > > > >> >
> > > > >> > > Hi all, I would like to kick off discussion for KIP-550 which
> > > proposes
> > > > >> a>
> > > > >> > > mechanism to detect and delete stray partitions on a broker.
> > > > >> Suggestions>
> > > > >> > > and feedback are welcome.>
> > > > >> > >>
> > > > >> > > - Dhruvil>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Reply via email to