Re: [DISCUSS] KIP-1159: Large message reference based Serializer

Omnia Ibrahim Thu, 24 Apr 2025 09:42:10 -0700

Hi Luke, thanks for having the time to look into the KIP
> 2. It's not clear how the "retry count" comes into play in the KIP. It
> needs more explanation.
My initial thinking is the retry configs are a must for all blob stores, so we 
can provide them, and validate them for free for all blob stores so not every 
implementation will go through verifying them.


> 3. What does "LargeMessageFormatter" do in the process?
> I thought all we want to do is to replace the "large value data" into a
> path, and consumers will read the path via blob store class.
> All these should be done in serializer/deserializer, so why do we need the
> formatter?

I wanted to bit of more info than just the path to download, for example I want 
to add stuff like the class path for the original blob store for example if 
consumer is setup with the unmatched blob store to the one used during 
publishing. 
I have updated the KIP to simplify this by having this always as a simple json 
of path and publisher class which can be represented as PayloadReferenceValue. 
WDYT?

> 4. In the BlobStore, it looks like we presume users will use object stores,
> which is not good.
> Could we make it more generic? Javadoc, method names, …
This is a good point, I have updated the method names and Javadoc. I also 
thinking of renaming the class name to PayloadStoreinstead of BlobStore as blob 
store still tide to object store as well. To set some context here, I am 
proposing this after working with some community form Apache Cassandra who are 
working on Cassandra CEP 
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar#CEP44:KafkaintegrationforCassandraCDCusingSidecar-LargeBlob>-44
 to handle large CDC and the initial thinking was let’s publish any large cdc 
to an object store instead of Kafka this why the naming was suggesting “blob 
store” only. 

> 5. It would be good to have some explanation for the purpose of each new
> interface/class, and clear javadoc for each method.

Updated the KIP with javadocs

> 6. About the BlobIdGenerator, why do we need it, could you explain more?
> Again, I thought we only need to replace value to a path, and add the
> "large-message" header, so that when consumer reads this message, it'll
> read the path from value and get the original data via BlobStore. Why do we
> need this ID generator? I think users should do the object naming when
> putting the object by themselves, not via another interface. WDYT?
In some cases generating the ID might need some smart work for example to avoid 
s3 throttling the recommended way on their doc it to create sub paths under the 
original bucket, to decide this we might hash the data to find a suitable 
sub-path. 
Here is an example of how I would generate an path for s3 file 
```
public String id(byte[] data) {
   String subFolder = topic + "-" + Utils.toPositive(Utils.murmur2(data)) % 
distributionFactor // distributionFactor is a config for the Id generator and 
it represent the max number of sub-folders under the bucket
   return subFolder + “/“ + UUID.randomUUID().toString()
}
```
Hope this example clarify a bit. However I do agree here it might not need a 
class. I have move it to be part of the store class. 

Please let me know WDYT of the final shape of the KIP now

Thanks
Omnia

> On 24 Apr 2025, at 13:31, Luke Chen <show...@gmail.com> wrote:
> 
> Hi Omnia,
> 
> Thanks for proposing this feature that many users expected.
> 
> Some comments:
> 1. It's quite interesting to see the idea of chained
> serializer/deserializer used here. I like it.
> 
> 2. It's not clear how the "retry count" comes into play in the KIP. It
> needs more explanation.
> 
> 3. What does "LargeMessageFormatter" do in the process?
> I thought all we want to do is to replace the "large value data" into a
> path, and consumers will read the path via blob store class.
> All these should be done in serializer/deserializer, so why do we need the
> formatter?
> 
> 4. In the BlobStore, it looks like we presume users will use object stores,
> which is not good.
> Could we make it more generic? Javadoc, method names, ...
> 
> 5. It would be good to have some explanation for the purpose of each new
> interface/class, and clear javadoc for each method.
> 
> 6. About the BlobIdGenerator, why do we need it, could you explain more?
> Again, I thought we only need to replace value to a path, and add the
> "large-message" header, so that when consumer reads this message, it'll
> read the path from value and get the original data via BlobStore. Why do we
> need this ID generator? I think users should do the object naming when
> putting the object by themselves, not via another interface. WDYT?
> 
> Thanks.
> Luke
> 
> 
> 
> 
> 
> On Thu, Apr 10, 2025 at 9:31 PM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
> wrote:
> 
>> Hi there I would like to start discussions on
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1159%3A+Large+message+reference+based+Serializer
>> 
>> Thanks
>> Omnia
>> 
>>

Re: [DISCUSS] KIP-1159: Large message reference based Serializer

Reply via email to