Mayuresh,

Thanks for the comments.
The requirement is that we need to pick up segments that are older than
maxCompactionLagMs for compaction.
maxCompactionLagMs is an upper-bound, which implies that picking up
segments for compaction earlier doesn't violated the policy.
We use the creation time of a segment as an estimation of its records
arrival time, so these records can be compacted no later than
maxCompactionLagMs.

On the other hand, compaction is an expensive operation, we don't want to
compact the log partition whenever a new segment is sealed.
Therefore, we want to pick up a segment for compaction when the segment is
closed to mandatory max compaction lag (so we use segment creation time as
an estimation.)


Xiongqi (Wesley) Wu


On Mon, Oct 15, 2018 at 5:54 PM Mayuresh Gharat <gharatmayures...@gmail.com>
wrote:

> Hi Wesley,
>
> Thanks for the KIP and sorry for being late to the party.
>  I wanted to understand, the scenario you mentioned in Proposed changes :
>
> -
> >
> > Estimate the earliest message timestamp of an un-compacted log segment.
> we
> > only need to estimate earliest message timestamp for un-compacted log
> > segments to ensure timely compaction because the deletion requests that
> > belong to compacted segments have already been processed.
> >
> >    1.
> >
> >    for the first (earliest) log segment:  The estimated earliest
> >    timestamp is set to the timestamp of the first message if timestamp is
> >    present in the message. Otherwise, the estimated earliest timestamp
> is set
> >    to "segment.largestTimestamp - maxSegmentMs”
> >     (segment.largestTimestamp is lastModified time of the log segment or
> max
> >    timestamp we see for the log segment.). In the later case, the actual
> >    timestamp of the first message might be later than the estimation,
> but it
> >    is safe to pick up the log for compaction earlier.
> >
> > When we say "actual timestamp of the first message might be later than
> the
> estimation, but it is safe to pick up the log for compaction earlier.",
> doesn't that violate the assumption that we will consider a segment for
> compaction only if the time of creation the segment has crossed the "now -
> maxCompactionLagMs" ?
>
> Thanks,
>
> Mayuresh
>
> On Mon, Sep 3, 2018 at 7:28 PM Brett Rann <br...@zendesk.com.invalid>
> wrote:
>
> > Might also be worth moving to a vote thread? Discussion seems to have
> gone
> > as far as it can.
> >
> > > On 4 Sep 2018, at 12:08, xiongqi wu <xiongq...@gmail.com> wrote:
> > >
> > > Brett,
> > >
> > > Yes, I will post PR tomorrow.
> > >
> > > Xiongqi (Wesley) Wu
> > >
> > >
> > > On Sun, Sep 2, 2018 at 6:28 PM Brett Rann <br...@zendesk.com.invalid>
> > wrote:
> > >
> > > > +1 (non-binding) from me on the interface. I'd like to see someone
> > familiar
> > > > with
> > > > the code comment on the approach, and note there's a couple of
> > different
> > > > approaches: what's documented in the KIP, and what Xiaohe Dong was
> > working
> > > > on
> > > > here:
> > > >
> > > >
> >
> https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-max-lifetime-2.0
> > > >
> > > > If you have code working already Xiongqi Wu could you share a PR? I'd
> > be
> > > > happy
> > > > to start testing.
> > > >
> > > > On Tue, Aug 28, 2018 at 5:57 AM xiongqi wu <xiongq...@gmail.com>
> > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Do you have any additional comments on this KIP?
> > > > >
> > > > >
> > > > > On Thu, Aug 16, 2018 at 9:17 PM, xiongqi wu <xiongq...@gmail.com>
> > wrote:
> > > > >
> > > > > > on 2)
> > > > > > The offsetmap is built starting from dirty segment.
> > > > > > The compaction starts from the beginning of the log partition.
> > That's
> > > > how
> > > > > > it ensure the deletion of tomb keys.
> > > > > > I will double check tomorrow.
> > > > > >
> > > > > > Xiongqi (Wesley) Wu
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 16, 2018 at 6:46 PM Brett Rann
> > <br...@zendesk.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > >> To just clarify a bit on 1. whether there's an external
> storage/DB
> > > > isn't
> > > > > >> relevant here.
> > > > > >> Compacted topics allow a tombstone record to be sent (a null
> value
> > > > for a
> > > > > >> key) which
> > > > > >> currently will result in old values for that key being deleted
> if
> > some
> > > > > >> conditions are met.
> > > > > >> There are existing controls to make sure the old values will
> stay
> > > > around
> > > > > >> for a minimum
> > > > > >> time at least, but no dedicated control to ensure the tombstone
> > will
> > > > > >> delete
> > > > > >> within a
> > > > > >> maximum time.
> > > > > >>
> > > > > >> One popular reason that maximum time for deletion is desirable
> > right
> > > > now
> > > > > >> is
> > > > > >> GDPR with
> > > > > >> PII. But we're not proposing any GDPR awareness in kafka, just
> > being
> > > > > able
> > > > > >> to guarantee
> > > > > >> a max time where a tombstoned key will be removed from the
> > compacted
> > > > > >> topic.
> > > > > >>
> > > > > >> on 2)
> > > > > >> huh, i thought it kept track of the first dirty segment and
> didn't
> > > > > >> recompact older "clean" ones.
> > > > > >> But I didn't look at code or test for that.
> > > > > >>
> > > > > >> On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <
> xiongq...@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >> > 1, Owner of data (in this sense, kafka is the not the owner of
> > data)
> > > > > >> > should keep track of lifecycle of the data in some external
> > > > > storage/DB.
> > > > > >> > The owner determines when to delete the data and send the
> delete
> > > > > >> request to
> > > > > >> > kafka. Kafka doesn't know about the content of data but to
> > provide a
> > > > > >> mean
> > > > > >> > for deletion.
> > > > > >> >
> > > > > >> > 2 , each time compaction runs, it will start from first
> > segments (no
> > > > > >> > matter if it is compacted or not). The time estimation here is
> > only
> > > > > used
> > > > > >> > to determine whether we should run compaction on this log
> > partition.
> > > > > So
> > > > > >> we
> > > > > >> > only need to estimate uncompacted segments.
> > > > > >> >
> > > > > >> > On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <
> lindon...@gmail.com>
> > > > > wrote:
> > > > > >> >
> > > > > >> > > Hey Xiongqi,
> > > > > >> > >
> > > > > >> > > Thanks for the update. I have two questions for the latest
> > KIP.
> > > > > >> > >
> > > > > >> > > 1) The motivation section says that one use case is to
> delete
> > PII
> > > > > >> > (Personal
> > > > > >> > > Identifiable information) data within 7 days while keeping
> > non-PII
> > > > > >> > > indefinitely in compacted format. I suppose the use-case
> > depends
> > > > on
> > > > > >> the
> > > > > >> > > application to determine when to delete those PII data.
> Could
> > you
> > > > > >> explain
> > > > > >> > > how can application reliably determine the set of keys that
> > should
> > > > > be
> > > > > >> > > deleted? Is application required to always messages from the
> > topic
> > > > > >> after
> > > > > >> > > every restart and determine the keys to be deleted by
> looking
> > at
> > > > > >> message
> > > > > >> > > timestamp, or is application supposed to persist the key->
> > > > timstamp
> > > > > >> > > information in a separate persistent storage system?
> > > > > >> > >
> > > > > >> > > 2) It is mentioned in the KIP that "we only need to estimate
> > > > > earliest
> > > > > >> > > message timestamp for un-compacted log segments because the
> > > > deletion
> > > > > >> > > requests that belong to compacted segments have already been
> > > > > >> processed".
> > > > > >> > > Not sure if it is correct. If a segment is compacted before
> > user
> > > > > sends
> > > > > >> > > message to delete a key in this segment, it seems that we
> > still
> > > > need
> > > > > >> to
> > > > > >> > > ensure that the segment will be compacted again within the
> > given
> > > > > time
> > > > > >> > after
> > > > > >> > > the deletion is requested, right?
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Dong
> > > > > >> > >
> > > > > >> > > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <
> > xiongq...@gmail.com
> > > > >
> > > > > >> > wrote:
> > > > > >> > >
> > > > > >> > > > Hi Xiaohe,
> > > > > >> > > >
> > > > > >> > > > Quick note:
> > > > > >> > > > 1) Use minimum of segment.ms and max.compaction.lag.ms
> > > > > >> > > > <http://max.compaction.ms
> > > > > <http://max.compaction.ms>
> > > > > >> > <http://max.compaction.ms
> > > > > <http://max.compaction.ms>>>
> > > > > >> > > >
> > > > > >> > > > 2) I am not sure if I get your second question. first, we
> > have
> > > > > >> jitter
> > > > > >> > > when
> > > > > >> > > > we roll the active segment. second, on each compaction, we
> > > > compact
> > > > > >> upto
> > > > > >> > > > the offsetmap could allow. Those will not lead to perfect
> > > > > compaction
> > > > > >> > > storm
> > > > > >> > > > overtime. In addition, I expect we are setting
> > > > > >> max.compaction.lag.ms
> > > > > >> > on
> > > > > >> > > > the order of days.
> > > > > >> > > >
> > > > > >> > > > 3) I don't have access to the confluent community slack
> for
> > > > now. I
> > > > > >> am
> > > > > >> > > > reachable via the google handle out.
> > > > > >> > > > To avoid the double effort, here is my plan:
> > > > > >> > > > a) Collect more feedback and feature requriement on the
> KIP.
> > > > > >> > > > b) Wait unitl this KIP is approved.
> > > > > >> > > > c) I will address any additional requirements in the
> > > > > implementation.
> > > > > >> > (My
> > > > > >> > > > current implementation only complies to whatever described
> > in
> > > > the
> > > > > >> KIP
> > > > > >> > > now)
> > > > > >> > > > d) I can share the code with the you and community see you
> > want
> > > > to
> > > > > >> add
> > > > > >> > > > anything.
> > > > > >> > > > e) submission through committee
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <
> > > > > >> dannyriv...@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi Xiongqi
> > > > > >> > > > >
> > > > > >> > > > > Thanks for thinking about implementing this as well. :)
> > > > > >> > > > >
> > > > > >> > > > > I was thinking about using `segment.ms` to trigger the
> > > > segment
> > > > > >> roll.
> > > > > >> > > > > Also, its value can be the largest time bias for the
> > record
> > > > > >> deletion.
> > > > > >> > > For
> > > > > >> > > > > example, if the `segment.ms` is 1 day and `
> > max.compaction.ms`
> > > > > is
> > > > > >> 30
> > > > > >> > > > days,
> > > > > >> > > > > the compaction may happen around 31 days.
> > > > > >> > > > >
> > > > > >> > > > > For my curiosity, is there a way we can do some
> > performance
> > > > test
> > > > > >> for
> > > > > >> > > this
> > > > > >> > > > > and any tools you can recommend. As you know,
> previously,
> > it
> > > > is
> > > > > >> > cleaned
> > > > > >> > > > up
> > > > > >> > > > > by respecting dirty ratio, but now it may happen anytime
> > if
> > > > max
> > > > > >> lag
> > > > > >> > has
> > > > > >> > > > > passed for each message. I wonder what would happen if
> > clients
> > > > > >> send
> > > > > >> > > huge
> > > > > >> > > > > amount of tombstone records at the same time.
> > > > > >> > > > >
> > > > > >> > > > > I am looking forward to have a quick chat with you to
> > avoid
> > > > > double
> > > > > >> > > effort
> > > > > >> > > > > on this. I am in confluent community slack during the
> work
> > > > time.
> > > > > >> My
> > > > > >> > > name
> > > > > >> > > > is
> > > > > >> > > > > Xiaohe Dong. :)
> > > > > >> > > > >
> > > > > >> > > > > Rgds
> > > > > >> > > > > Xiaohe Dong
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > On 2018/08/16 01:22:22, xiongqi wu <xiongq...@gmail.com
> >
> > > > wrote:
> > > > > >> > > > > > Brett,
> > > > > >> > > > > >
> > > > > >> > > > > > Thank you for your comments.
> > > > > >> > > > > > I was thinking since we already has immediate
> compaction
> > > > > >> setting by
> > > > > >> > > > > setting
> > > > > >> > > > > > min dirty ratio to 0, so I decide to use "0" as
> disabled
> > > > > state.
> > > > > >> > > > > > I am ok to go with -1(disable), 0 (immediate) options.
> > > > > >> > > > > >
> > > > > >> > > > > > For the implementation, there are a few differences
> > between
> > > > > mine
> > > > > >> > and
> > > > > >> > > > > > "Xiaohe Dong"'s :
> > > > > >> > > > > > 1) I used the estimated creation time of a log segment
> > > > instead
> > > > > >> of
> > > > > >> > > > largest
> > > > > >> > > > > > timestamp of a log to determine the compaction
> > eligibility,
> > > > > >> > because a
> > > > > >> > > > log
> > > > > >> > > > > > segment might stay as an active segment up to "max
> > > > compaction
> > > > > >> lag".
> > > > > >> > > > (see
> > > > > >> > > > > > the KIP for detail).
> > > > > >> > > > > > 2) I measure how much bytes that we must clean to
> > follow the
> > > > > >> "max
> > > > > >> > > > > > compaction lag" rule, and use that to determine the
> > order of
> > > > > >> > > > compaction.
> > > > > >> > > > > > 3) force active segment to roll to follow the "max
> > > > compaction
> > > > > >> lag"
> > > > > >> > > > > >
> > > > > >> > > > > > I can share my code so we can coordinate.
> > > > > >> > > > > >
> > > > > >> > > > > > I haven't think about a new API to force a compaction.
> > what
> > > > is
> > > > > >> the
> > > > > >> > > use
> > > > > >> > > > > case
> > > > > >> > > > > > for this one?
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > > > > >> > > <br...@zendesk.com.invalid
> > > > > >> > > > >
> > > > > >> > > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > We've been looking into this too.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Mailing list:
> > > > > >> > > > > > > https://lists.apache.org/thread.html/
> > > > > <https://lists.apache.org/thread.html/>
> > > > > >> > <https://lists.apache.org/thread.html/
> > > > > <https://lists.apache.org/thread.html/>>
> > > > > >> > > ed7f6a6589f94e8c2a705553f364ef
> > > > > >> > > > > > > 599cb6915e4c3ba9b561e610e4@%3Cdev.kafka.apache.org
> %3E
> > > > > >> > > > > > > jira wish:
> > > > https://issues.apache.org/jira/browse/KAFKA-7137
> > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>
> > > > > >> > <https://issues.apache.org/jira/browse/KAFKA-7137
> > > > > <https://issues.apache.org/jira/browse/KAFKA-7137>>
> > > > > >> > > > > > > confluent slack discussion:
> > > > > >> > > > > > >
> > https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> > > > > >> > <https://confluentcommunity.slack.com/archives/C49R61XMM/
> > > > > <https://confluentcommunity.slack.com/archives/C49R61XMM/>>
> > > > > >> > > > > p1530760121000039
> > > > > >> > > > > > >
> > > > > >> > > > > > > A person on my team has started on code so you might
> > want
> > > > to
> > > > > >> > > > > coordinate:
> > > > > >> > > > > > >
> > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> > > > > >> > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> > > > > <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->>
> > > > > >> > > > > > > cleaner-compaction-max-lifetime-2.0
> > > > > >> > > > > > >
> > > > > >> > > > > > > He's been working with Jason Gustafson and James
> Chen
> > > > around
> > > > > >> the
> > > > > >> > > > > changes.
> > > > > >> > > > > > > You can ping him on confluent slack as Xiaohe Dong.
> > > > > >> > > > > > >
> > > > > >> > > > > > > It's great to know others are thinking on it as
> well.
> > > > > >> > > > > > >
> > > > > >> > > > > > > You've added the requirement to force a segment roll
> > which
> > > > > we
> > > > > >> > > hadn't
> > > > > >> > > > > gotten
> > > > > >> > > > > > > to yet, which is great. I was content with it not
> > > > including
> > > > > >> the
> > > > > >> > > > active
> > > > > >> > > > > > > segment.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Adding topic level configuration "
> > max.compaction.lag.ms
> > > > ",
> > > > > >> and
> > > > > >> > > > > > > corresponding broker configuration "
> > > > > >> > log.cleaner.max.compaction.la
> > > > > >> > > > g.ms
> > > > > >> > > > > ",
> > > > > >> > > > > > > which is set to 0 (disabled) by default.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Glancing at some other settings convention seems to
> > me to
> > > > be
> > > > > >> -1
> > > > > >> > for
> > > > > >> > > > > > > disabled (or infinite, which is more meaningful
> > here). 0
> > > > to
> > > > > me
> > > > > >> > > > implies
> > > > > >> > > > > > > instant, a little quicker than 1.
> > > > > >> > > > > > >
> > > > > >> > > > > > > We've been trying to think about a way to trigger
> > > > compaction
> > > > > >> as
> > > > > >> > > well
> > > > > >> > > > > > > through an API call, which would need to be flagged
> > > > > somewhere
> > > > > >> (ZK
> > > > > >> > > > > admin/
> > > > > >> > > > > > > space?) but we're struggling to think how that would
> > be
> > > > > >> > coordinated
> > > > > >> > > > > across
> > > > > >> > > > > > > brokers and partitions. Have you given any thought
> to
> > > > that?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <
> > > > > >> xiongq...@gmail.com>
> > > > > >> > > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Eno, Dong,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > I have updated the KIP. We decide not to address
> the
> > > > issue
> > > > > >> that
> > > > > >> > > we
> > > > > >> > > > > might
> > > > > >> > > > > > > > have for both compaction and time retention
> enabled
> > > > topics
> > > > > >> (see
> > > > > >> > > the
> > > > > >> > > > > > > > rejected alternative item 2). This KIP will only
> > ensure
> > > > > log
> > > > > >> can
> > > > > >> > > be
> > > > > >> > > > > > > > compacted after a specified time-interval.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > As suggested by Dong, we will also enforce "
> > > > > >> > > max.compaction.lag.ms"
> > > > > >> > > > > is
> > > > > >> > > > > > > not
> > > > > >> > > > > > > > less than "min.compaction.lag.ms".
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > > >> > > > > Time-based
> > > > > >> > > > > > > log
> > > > > >> > > > > > > > compaction policy
> > > > > >> > > > > > > > <
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>>
> > > > > >> > > > > Time-based
> > > > > >> > > > > > > log compaction policy>
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> > > > > >> > xiongq...@gmail.com
> > > > > >> > > >
> > > > > >> > > > > wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Per discussion with Dong, he made a very good
> > point
> > > > that
> > > > > >> if
> > > > > >> > > > > compaction
> > > > > >> > > > > > > > > and time based retention are both enabled on a
> > topic,
> > > > > the
> > > > > >> > > > > compaction
> > > > > >> > > > > > > > might
> > > > > >> > > > > > > > > prevent records from being deleted on time. The
> > reason
> > > > > is
> > > > > >> > when
> > > > > >> > > > > > > compacting
> > > > > >> > > > > > > > > multiple segments into one single segment, the
> > newly
> > > > > >> created
> > > > > >> > > > > segment
> > > > > >> > > > > > > will
> > > > > >> > > > > > > > > have same lastmodified timestamp as latest
> > original
> > > > > >> segment.
> > > > > >> > We
> > > > > >> > > > > lose
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > timestamp of all original segments except the
> last
> > > > one.
> > > > > >> As a
> > > > > >> > > > > result,
> > > > > >> > > > > > > > > records might not be deleted as it should be
> > through
> > > > > time
> > > > > >> > based
> > > > > >> > > > > > > > retention.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > With the current KIP proposal, if we want to
> > ensure
> > > > > timely
> > > > > >> > > > > deletion, we
> > > > > >> > > > > > > > > have the following configurations:
> > > > > >> > > > > > > > > 1) enable time based log compaction only :
> > deletion is
> > > > > >> done
> > > > > >> > > > though
> > > > > >> > > > > > > > > overriding the same key
> > > > > >> > > > > > > > > 2) enable time based log retention only:
> deletion
> > is
> > > > > done
> > > > > >> > > though
> > > > > >> > > > > > > > > time-based retention
> > > > > >> > > > > > > > > 3) enable both log compaction and time based
> > > > retention:
> > > > > >> > > Deletion
> > > > > >> > > > > is not
> > > > > >> > > > > > > > > guaranteed.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Not sure if we have use case 3 and also want
> > deletion
> > > > to
> > > > > >> > happen
> > > > > >> > > > on
> > > > > >> > > > > > > time.
> > > > > >> > > > > > > > > There are several options to address deletion
> > issue
> > > > when
> > > > > >> > enable
> > > > > >> > > > > both
> > > > > >> > > > > > > > > compaction and retention:
> > > > > >> > > > > > > > > A) During log compaction, looking into record
> > > > timestamp
> > > > > to
> > > > > >> > > delete
> > > > > >> > > > > > > expired
> > > > > >> > > > > > > > > records. This can be done in compaction logic
> > itself
> > > > or
> > > > > >> use
> > > > > >> > > > > > > > > AdminClient.deleteRecords() . But this assumes
> we
> > have
> > > > > >> record
> > > > > >> > > > > > > timestamp.
> > > > > >> > > > > > > > > B) retain the lastModifed time of original
> > segments
> > > > > during
> > > > > >> > log
> > > > > >> > > > > > > > compaction.
> > > > > >> > > > > > > > > This requires extra meta data to record the
> > > > information
> > > > > or
> > > > > >> > not
> > > > > >> > > > > grouping
> > > > > >> > > > > > > > > multiple segments into one during compaction.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > If we have use case 3 in general, I would prefer
> > > > > solution
> > > > > >> A
> > > > > >> > and
> > > > > >> > > > > rely on
> > > > > >> > > > > > > > > record timestamp.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Two questions:
> > > > > >> > > > > > > > > Do we have use case 3? Is it nice to have or
> must
> > > > have?
> > > > > >> > > > > > > > > If we have use case 3 and want to go with
> > solution A,
> > > > > >> should
> > > > > >> > we
> > > > > >> > > > > > > introduce
> > > > > >> > > > > > > > > a new configuration to enforce deletion by
> > timestamp?
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu <
> > > > > >> > > xiongq...@gmail.com
> > > > > >> > > > >
> > > > > >> > > > > > > wrote:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >> Dong,
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Thanks for the comment.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> There are two retention policy: log compaction
> > and
> > > > time
> > > > > >> > based
> > > > > >> > > > > > > retention.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Log compaction:
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> we have use cases to keep infinite retention
> of a
> > > > topic
> > > > > >> > (only
> > > > > >> > > > > > > > >> compaction). GDPR cares about deletion of PII
> > > > (personal
> > > > > >> > > > > identifiable
> > > > > >> > > > > > > > >> information) data.
> > > > > >> > > > > > > > >> Since Kafka doesn't know what records contain
> > PII, it
> > > > > >> relies
> > > > > >> > > on
> > > > > >> > > > > upper
> > > > > >> > > > > > > > >> layer to delete those records.
> > > > > >> > > > > > > > >> For those infinite retention uses uses, kafka
> > needs
> > > > to
> > > > > >> > > provide a
> > > > > >> > > > > way
> > > > > >> > > > > > > to
> > > > > >> > > > > > > > >> enforce compaction on time. This is what we try
> > to
> > > > > >> address
> > > > > >> > in
> > > > > >> > > > this
> > > > > >> > > > > > > KIP.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Time based retention,
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> There are also use cases that users of Kafka
> > might
> > > > want
> > > > > >> to
> > > > > >> > > > expire
> > > > > >> > > > > all
> > > > > >> > > > > > > > >> their data.
> > > > > >> > > > > > > > >> In those cases, they can use time based
> > retention of
> > > > > >> their
> > > > > >> > > > topics.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Regarding your first question, if a user wants
> to
> > > > > delete
> > > > > >> a
> > > > > >> > key
> > > > > >> > > > in
> > > > > >> > > > > the
> > > > > >> > > > > > > > >> log compaction topic, the user has to send a
> > deletion
> > > > > >> using
> > > > > >> > > the
> > > > > >> > > > > same
> > > > > >> > > > > > > > key.
> > > > > >> > > > > > > > >> Kafka only makes sure the deletion will happen
> > under
> > > > a
> > > > > >> > certain
> > > > > >> > > > > time
> > > > > >> > > > > > > > >> periods (like 2 days/7 days).
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Regarding your second question. In most cases,
> we
> > > > might
> > > > > >> want
> > > > > >> > > to
> > > > > >> > > > > delete
> > > > > >> > > > > > > > >> all duplicated keys at the same time.
> > > > > >> > > > > > > > >> Compaction might be more efficient since we
> need
> > to
> > > > > scan
> > > > > >> the
> > > > > >> > > log
> > > > > >> > > > > and
> > > > > >> > > > > > > > find
> > > > > >> > > > > > > > >> all duplicates. However, the expected use case
> > is to
> > > > > set
> > > > > >> the
> > > > > >> > > > time
> > > > > >> > > > > > > based
> > > > > >> > > > > > > > >> compaction interval on the order of days, and
> be
> > > > larger
> > > > > >> than
> > > > > >> > > > 'min
> > > > > >> > > > > > > > >> compaction lag". We don't want log compaction
> to
> > > > happen
> > > > > >> > > > frequently
> > > > > >> > > > > > > since
> > > > > >> > > > > > > > >> it is expensive. The purpose is to help low
> > > > production
> > > > > >> rate
> > > > > >> > > > topic
> > > > > >> > > > > to
> > > > > >> > > > > > > get
> > > > > >> > > > > > > > >> compacted on time. For the topic with "normal"
> > > > incoming
> > > > > >> > > message
> > > > > >> > > > > > > message
> > > > > >> > > > > > > > >> rate, the "min dirty ratio" might have
> triggered
> > the
> > > > > >> > > compaction
> > > > > >> > > > > before
> > > > > >> > > > > > > > this
> > > > > >> > > > > > > > >> time based compaction policy takes effect.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> Eno,
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> For your question, like I mentioned we have
> long
> > time
> > > > > >> > > retention
> > > > > >> > > > > use
> > > > > >> > > > > > > case
> > > > > >> > > > > > > > >> for log compacted topic, but we want to provide
> > > > ability
> > > > > >> to
> > > > > >> > > > delete
> > > > > >> > > > > > > > certain
> > > > > >> > > > > > > > >> PII records on time.
> > > > > >> > > > > > > > >> Kafka itself doesn't know whether a record
> > contains
> > > > > >> > sensitive
> > > > > >> > > > > > > > information
> > > > > >> > > > > > > > >> and relies on the user for deletion.
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> > > > > >> > > lindon...@gmail.com>
> > > > > >> > > > > > > wrote:
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>> Hey Xiongqi,
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> Thanks for the KIP. I have two questions
> > regarding
> > > > the
> > > > > >> > > use-case
> > > > > >> > > > > for
> > > > > >> > > > > > > > >>> meeting
> > > > > >> > > > > > > > >>> GDPR requirement.
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> 1) If I recall correctly, one of the GDPR
> > > > requirement
> > > > > is
> > > > > >> > that
> > > > > >> > > > we
> > > > > >> > > > > can
> > > > > >> > > > > > > > not
> > > > > >> > > > > > > > >>> keep messages longer than e.g. 30 days in
> > storage
> > > > > (e.g.
> > > > > >> > > Kafka).
> > > > > >> > > > > Say
> > > > > >> > > > > > > > there
> > > > > >> > > > > > > > >>> exists a partition p0 which contains message1
> > with
> > > > > key1
> > > > > >> and
> > > > > >> > > > > message2
> > > > > >> > > > > > > > with
> > > > > >> > > > > > > > >>> key2. And then user keeps producing messages
> > with
> > > > > >> key=key2
> > > > > >> > to
> > > > > >> > > > > this
> > > > > >> > > > > > > > >>> partition. Since message1 with key1 is never
> > > > > overridden,
> > > > > >> > > sooner
> > > > > >> > > > > or
> > > > > >> > > > > > > > later
> > > > > >> > > > > > > > >>> we
> > > > > >> > > > > > > > >>> will want to delete message1 and keep the
> latest
> > > > > message
> > > > > >> > with
> > > > > >> > > > > > > key=key2.
> > > > > >> > > > > > > > >>> But
> > > > > >> > > > > > > > >>> currently it looks like log compact logic in
> > Kafka
> > > > > will
> > > > > >> > > always
> > > > > >> > > > > put
> > > > > >> > > > > > > > these
> > > > > >> > > > > > > > >>> messages in the same segment. Will this be an
> > issue?
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> 2) The current KIP intends to provide the
> > capability
> > > > > to
> > > > > >> > > delete
> > > > > >> > > > a
> > > > > >> > > > > > > given
> > > > > >> > > > > > > > >>> message in log compacted topic. Does such
> > use-case
> > > > > also
> > > > > >> > > require
> > > > > >> > > > > Kafka
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > >>> keep the messages produced before the given
> > message?
> > > > > If
> > > > > >> > yes,
> > > > > >> > > > > then we
> > > > > >> > > > > > > > can
> > > > > >> > > > > > > > >>> probably just use AdminClient.deleteRecords()
> or
> > > > > >> time-based
> > > > > >> > > log
> > > > > >> > > > > > > > retention
> > > > > >> > > > > > > > >>> to meet the use-case requirement. If no, do
> you
> > know
> > > > > >> what
> > > > > >> > is
> > > > > >> > > > the
> > > > > >> > > > > > > GDPR's
> > > > > >> > > > > > > > >>> requirement on time-to-deletion after user
> > > > explicitly
> > > > > >> > > requests
> > > > > >> > > > > the
> > > > > >> > > > > > > > >>> deletion
> > > > > >> > > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> Thanks,
> > > > > >> > > > > > > > >>> Dong
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu <
> > > > > >> > > > xiongq...@gmail.com
> > > > > >> > > > > >
> > > > > >> > > > > > > > wrote:
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>> > Hi Eno,
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > The GDPR request we are getting here at
> > linkedin
> > > > is
> > > > > >> if we
> > > > > >> > > > get a
> > > > > >> > > > > > > > >>> request to
> > > > > >> > > > > > > > >>> > delete a record through a null key on a log
> > > > > compacted
> > > > > >> > > topic,
> > > > > >> > > > > > > > >>> > we want to delete the record via compaction
> > in a
> > > > > given
> > > > > >> > time
> > > > > >> > > > > period
> > > > > >> > > > > > > > >>> like 2
> > > > > >> > > > > > > > >>> > days (whatever is required by the policy).
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > There might be other issues (such as orphan
> > log
> > > > > >> segments
> > > > > >> > > > under
> > > > > >> > > > > > > > certain
> > > > > >> > > > > > > > >>> > conditions) that lead to GDPR problem but
> > they are
> > > > > >> more
> > > > > >> > > like
> > > > > >> > > > > > > > >>> something we
> > > > > >> > > > > > > > >>> > need to fix anyway regardless of GDPR.
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno
> Thereska
> > <
> > > > > >> > > > > > > > eno.there...@gmail.com>
> > > > > >> > > > > > > > >>> > wrote:
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>> > > Hello,
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > Thanks for the KIP. I'd like to see a more
> > > > precise
> > > > > >> > > > > definition of
> > > > > >> > > > > > > > what
> > > > > >> > > > > > > > >>> > part
> > > > > >> > > > > > > > >>> > > of GDPR you are targeting as well as some
> > sort
> > > > of
> > > > > >> > > > > verification
> > > > > >> > > > > > > that
> > > > > >> > > > > > > > >>> this
> > > > > >> > > > > > > > >>> > > KIP actually addresses the problem. Right
> > now I
> > > > > find
> > > > > >> > > this a
> > > > > >> > > > > bit
> > > > > >> > > > > > > > >>> vague:
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > "Ability to delete a log message through
> > > > > compaction
> > > > > >> in
> > > > > >> > a
> > > > > >> > > > > timely
> > > > > >> > > > > > > > >>> manner
> > > > > >> > > > > > > > >>> > has
> > > > > >> > > > > > > > >>> > > become an important requirement in some
> use
> > > > cases
> > > > > >> > (e.g.,
> > > > > >> > > > > GDPR)"
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > Is there any guarantee that after this KIP
> > the
> > > > > GDPR
> > > > > >> > > problem
> > > > > >> > > > > is
> > > > > >> > > > > > > > >>> solved or
> > > > > >> > > > > > > > >>> > do
> > > > > >> > > > > > > > >>> > > we need to do something else as well,
> e.g.,
> > more
> > > > > >> KIPs?
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > Thanks
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > Eno
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi
> wu <
> > > > > >> > > > > xiongq...@gmail.com>
> > > > > >> > > > > > > > >>> wrote:
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> > > > Hi Kafka,
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > > This KIP tries to address GDPR concern
> to
> > > > > fulfill
> > > > > >> > > > deletion
> > > > > >> > > > > > > > request
> > > > > >> > > > > > > > >>> on
> > > > > >> > > > > > > > >>> > > time
> > > > > >> > > > > > > > >>> > > > through time-based log compaction on a
> > > > > compaction
> > > > > >> > > enabled
> > > > > >> > > > > > > topic:
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > > > > >> > > > > > > > <
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > >> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>>
> > > > > >> > > > > > > > >>> > > > 354%3A+Time-based+log+compaction+policy
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > > Any feedback will be appreciated.
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > > > > >> > > > > > > > >>> > > >
> > > > > >> > > > > > > > >>> > >
> > > > > >> > > > > > > > >>> >
> > > > > >> > > > > > > > >>>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >> --
> > > > > >> > > > > > > > >> Xiongqi (Wesley) Wu
> > > > > >> > > > > > > > >>
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > --
> > > > > >> > > > > > > > > Xiongqi (Wesley) Wu
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > --
> > > > > >> > > > > > > > Xiongqi (Wesley) Wu
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > --
> > > > > >> > > > > > >
> > > > > >> > > > > > > Brett Rann
> > > > > >> > > > > > >
> > > > > >> > > > > > > Senior DevOps Engineer
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > Zendesk International Ltd
> > > > > >> > > > > > >
> > > > > >> > > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > >> > > > > > >
> > > > > >> > > > > > > Mobile: +61 (0) 418 826 017
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > --
> > > > > >> > > > > > Xiongqi (Wesley) Wu
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > Xiongqi (Wesley) Wu
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Xiongqi (Wesley) Wu
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >>
> > > > > >> Brett Rann
> > > > > >>
> > > > > >> Senior DevOps Engineer
> > > > > >>
> > > > > >>
> > > > > >> Zendesk International Ltd
> > > > > >>
> > > > > >> 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > >>
> > > > > >> Mobile: +61 (0) 418 826 017
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Xiongqi (Wesley) Wu
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Brett Rann
> > > >
> > > > Senior DevOps Engineer
> > > >
> > > >
> > > > Zendesk International Ltd
> > > >
> > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > >
> > > > Mobile: +61 (0) 418 826 017
> > > >
> >
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Reply via email to