Re: [DISCUSS] KIP-280: Enhanced log compaction

Luís Cabral Mon, 09 Apr 2018 02:25:30 -0700

Hi,

bq. About naming: the broker config has `cleaner` in it, while the topic config 
does not. Is might be more consistent if either both have `cleaner` or none of 
them? (Personally, I would prefer to strip `cleaner` for the broker config.)

Do you mean the "log.cleaner." prefix attached to the global config?
If so, then this seems to be the naming approach used in this project, so I 
would rather stick to it (changing this will likely lead to a bigger --off 
topic-- discussion).

bq. The sentence "toggle the compaction strategy to this approach" does not 
make clear that is should be the default -- even if you follow a common Kafka 
pattern, the KIP should make it explicit (new people might not be familiar with 
the pattern and cannot infer from the name itself that it is supposed to be a 
global default setting that can be overwritten on a per-topic bases with the 
second config).

Sorry, I'm having some problems understanding this part.
If it is about the meaning of the properties (global vs local), then isn't this 
explained in the Kafka Docs? 
  https://kafka.apache.org/documentation

Otherwise, could you kindly elaborate?

bq. It might also be good, to elaborate why you suggest "long" for the 
compaction value is the KIP itself.

I would prefer that the definition simply be "long", to keep the code contract 
cleaner, and leave the clients to infer the usages according to their scenarios.
But if its important, then I can add it (let me know).

bq. One more though: the KIP basically allows, that a record with larger offset 
is deleted while a record with smaller offset is preserved (if the record with 
smaller offset has a larger "compaction value" than the record with the larger 
offset). I don't see a issue with this atm, just wanted to point it out, as it 
seems to be an important change in behavior (compaction does not strictly "move 
forward" any longer if you wish).

This is the whole purpose of the KIP, though. Is it not clear that this is the 
intention?

bq. Think you forgot to actually add this case. :)

Its added there, let me know if you have trouble getting to it.
Quoting: "When both records being compacted contain a matching "compaction 
value", then the record with the highest offset will be kept;"

bq. Do you effectively mean, that the value has exactly 8 bytes?

Not exactly... At the moment, the code in the pull request expects that the 
header value supports "byte[] => String => Long" as a valid conversion, so 
having a direct "long-as-byte[]" is not considered to be valid.
I am not sure which practice is preferred in Kafka, but given the base approach 
of "byte[]" as the header value, it seemed saner to support a String value 
rather than a Long.
Sadly, since Kafka needs to comprehend and process the actual value, I could 
not find a way to support both approaches.

This last part is tricky, and where I expected more debates to arise, though in 
the end it can be boiled down to a "Long vs String" thing, so I hope it doesn't 
become a blocker...

Thanks again for having a look and for the valuable feedback!
Cheers

On Sunday, April 8, 2018, 10:41:39 PM GMT+2, Matthias J. Sax 
<matth...@confluent.io> wrote: 

Thanks for clarification. This make sense to me.

About naming: the broker config has `cleaner` in it, while the topic
config does not. Is might be more consistent if either both have
`cleaner` or none of them? (Personally, I would prefer to strip
`cleaner` for the broker config.)

The sentence "toggle the compaction strategy to this approach" does not
make clear that is should be the default -- even if you follow a common
Kafka pattern, the KIP should make it explicit (new people might not be
familiar with the pattern and cannot infer from the name itself that it
is supposed to be a global default setting that can be overwritten on a
per-topic bases with the second config).

For the timestamp compaction, long is good enough. The idea about my
comment was really to add "timestamp" as reserved value for the
parameter and this should use the record's metadata timestamp (instead
of a header key named 'timestamp'). Before you update the KIP, other
might want to share their opinion about this.

It might also be good, to elaborate why you suggest "long" for the
compaction value is the KIP itself.

One more though: the KIP basically allows, that a record with larger
offset is deleted while a record with smaller offset is preserved (if
the record with smaller offset has a larger "compaction value" than the
record with the larger offset). I don't see a issue with this atm, just
wanted to point it out, as it seems to be an important change in
behavior (compaction does not strictly "move forward" any longer if you
wish).

>> bq. With the header approach it is not ensured that each record uses a 
>> unique "compaction value" (in contrast to offsets). Thus, what should the 
>> behaviour be, if two messages have the same "compaction value" in the 
>> header? (For timestamps, there is the same issue, and one idea was to use 
>> the offset as tie-breaker)
>> 
>> Sorry, I forgot to mention that in the KIP. In the pull request used with 
>> the KIP you can see that it is indeed using the offset as a tie-breaker in 
>> case the header values are the same.
>> I’ll make this clear by adding it as part of the proposed changes.

Think you forgot to actually add this case. :)

One more question:

> cast-able to "long"

Do you effectively mean, that the value has exactly 8 bytes?

-Matthias

On 4/8/18 3:44 AM, Luís Cabral wrote:
> Hi Matthias,
> 
> 
> bq. Why do we need two new configs? Why is the topic config 
> `compaction.strategy` not sufficient?
> 
> As I understand these configurations, one allows you to configure the default 
> for all topics while the other allows you to configure a single topic 
> directly.
> If this is incorrect, or if having a global toggle is not desired, then I 
> have no issues with having only the topic-relevant configuration.
> 
> 
> bq. For Kafka Streams we did think about a timestamp base compaction at some 
> point (internal brain storming)---we never thought this through in details, 
> but it might be a good idea to discuss it in this KIP and maybe piggy-back it 
> if we want it (as a second pre-defined strategy "timestamp" next to "offset"?)
> 
> The reason why I went for a “long” value here was mainly to support the 2 
> most common versioning patterns around: incremental numerals and timestamp 
> (long representing milliseconds since 0h, January 1, 1970 GMT).
> Is this not enough to represent the strategy you guys had in mind? I would 
> love to hear more about those discussions so this KIP can fulfil some more 
> requirements that I am not aware of at the moment.
> 
> 
> bq. With the header approach it is not ensured that each record uses a unique 
> "compaction value" (in contrast to offsets). Thus, what should the behaviour 
> be, if two messages have the same "compaction value" in the header? (For 
> timestamps, there is the same issue, and one idea was to use the offset as 
> tie-breaker)
> 
> Sorry, I forgot to mention that in the KIP. In the pull request used with the 
> KIP you can see that it is indeed using the offset as a tie-breaker in case 
> the header values are the same.
> I’ll make this clear by adding it as part of the proposed changes.
> 
> 
> bq. What should the behaviour be, if a message does not encode the 
> "compaction key" in the header?
> 
> The intention is that if both records being compared don’t have this value, 
> then the offset is used instead. However, if only one of these records 
> doesn’t have it, then whichever record has a “compaction key” is kept (as the 
> other is considered to be anomalous).
> I’ll also add this to the proposed changes in the KIP to highlight these 
> fall-back behaviours.
> 
> 
> Thank you for the feedback and looking forward for more replies!
> 
> Cheers
> 
> 
> From: Matthias J. Sax
> Sent: 08 April 2018 05:29
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-280: Enhanced log compaction
> 
> Luís,
> 
> thanks a lot for this KIP. Very interesting idea.
> 
> Couple of questions:
> 
>  - Why do we need two new configs? Why is the topic config
> `compaction.strategy` not sufficient?
> 
>  - For Kafka Streams we did think about a timestamp base compaction at
> some point (internal brain storming)---we never thought this through in
> details, but it might be a good idea to discuss it in this KIP and maybe
> piggy-back it if we want it (as a second pre-defined strategy
> "timestamp" next to "offset"?)
> 
>  - With the header approach it is not ensured that each record uses a
> unique "compaction value" (in contrast to offsets). Thus, what should
> the behavior be, if two messages have the same "compaction value" in the
> header? (For timestamps, there is the same issue, and one idea was to
> use the offset as tie-breaker)
> 
>  - What should the behavior be, if a message does not encode the
> "compaction key" in the header?
> 
> 
> -Matthias
> 
> 
> On 4/5/18 11:59 PM, Luís Cabral wrote:
>>  
>> Thank you very much for taking the time to read it.
>>
>> bq. In the 'Proposed Changes' section, can you expand 'OCC' ?
>> I've made the 'OCC' into a link pointing to the appropriate Wiki page 
>> explaining what it is. This is not a particularly important part of the 
>> change, it is just to reference the similarity between this proposal and the 
>> version control offered by OCC.
>>
>> bq. Is it possible to enumerate the keys ?
>> Do you mean hard-coding the header key used, rather than using a free-text 
>> solution? If I were to hard-code header with key "version", for example, 
>> then this may conflict with other clients that already use this header for 
>> something else, making it cumbersome for them to try and use this strategy, 
>> should they want it.
>> If I misunderstood your points, then please correct me. I appreciate the 
>> feedback!    On Thursday, April 5, 2018, 5:13:47 PM GMT+2, Ted Yu 
>> <yuzhih...@gmail.com> wrote:  
>>  
>>  In the 'Proposed Changes' section, can you expand 'OCC' ?
>>
>> bq. Specifically changing this to anything other than "*offset*"
>>
>> Is it possible to enumerate the keys ? In the future, more metadata would
>> be defined in record header - it is better to avoid collision.
>>
>> Cheers
>>
>> On Thu, Apr 5, 2018 at 2:05 AM, Luís Cabral <luis_cab...@yahoo.com.invalid>
>> wrote:
>>
>>>
>>> This is embarassingly hard to fix... going again...
>>> ----
>>> KIP-280:  https://cwiki.apache.org/confluence/display/
>>> KAFKA/KIP-280%3A+Enhanced+log+compaction
>>> -----
>>> Pull-4822:  https://github.com/apache/kafka/pull/4822
>>>
>>>
>>>     On Thursday, April 5, 2018, 11:03:22 AM GMT+2, Luís Cabral
>>> <luis_cab...@yahoo.com.INVALID> wrote:
>>>
>>>   Fixing the links:KIP-280:  https://cwiki.apache.org/confluence/display/
>>> KAFKA/KIP-280%3A+Enhanced+log+compactionPull-4822:  https://
>>> github.com/apache/kafka/pull/4822
>>>
>>>
>>> On 2018/04/0508:44:00, Luís Cabral <l...@yahoo.com.INVALID> wrote:
>>>> Helloall,>
>>>> Starting adiscussion for this feature.>
>>>> KIP-280  :  https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>> 280%3A+Enhanced+log+compactionPull-4822:  https://github.com/apache/
>>> kafka/pull/4822>
>>>
>>>> KindRegards,Luís>
>>>
>>>  
> 
> 
>

Re: [DISCUSS] KIP-280: Enhanced log compaction

Reply via email to