Hi Luke, >> 3. What does "LargeMessageFormatter" do in the process? >> I thought all we want to do is to replace the "large value data" into a >> path, and consumers will read the path via blob store class. >> All these should be done in serializer/deserializer, so why do we need the >> formatter? > > I wanted to bit of more info than just the path to download, for example I > want to add stuff like the class path for the original blob store for example > if consumer is setup with the unmatched blob store to the one used during > publishing. > I have updated the KIP to simplify this by having this always as a simple > json of path and publisher class which can be represented as > PayloadReferenceValue. WDYT?
I thought of another case where having the freedom to form the reference might be nice feature, which is DR. Let’s imagine this case where someone publish large messages to S3 and reference to Kafka topic then they want to have DR. This can be achievable if they have mirrored Kafka topic which contains the references but if S3 is unreachable form the DR backup location then the reference they have is bit useless. However if the message formatter is customisable then dev can implement a complicated store that publish to two store locations and the publish both references to Kafka as one message and the consumer store can download from either buckets that are available. I think keeping the door open to such use-case might be good feature but also having such use case might be questionable a bit with the latency it will add as we will be publishing to N number of stores. Regards Omnia > On 24 Apr 2025, at 17:40, Omnia Ibrahim <o.g.h.ibra...@gmail.com> wrote: > > Hi Luke, thanks for having the time to look into the KIP >> 2. It's not clear how the "retry count" comes into play in the KIP. It >> needs more explanation. > My initial thinking is the retry configs are a must for all blob stores, so > we can provide them, and validate them for free for all blob stores so not > every implementation will go through verifying them. > >> 3. What does "LargeMessageFormatter" do in the process? >> I thought all we want to do is to replace the "large value data" into a >> path, and consumers will read the path via blob store class. >> All these should be done in serializer/deserializer, so why do we need the >> formatter? > > I wanted to bit of more info than just the path to download, for example I > want to add stuff like the class path for the original blob store for example > if consumer is setup with the unmatched blob store to the one used during > publishing. > I have updated the KIP to simplify this by having this always as a simple > json of path and publisher class which can be represented as > PayloadReferenceValue. WDYT? > >> 4. In the BlobStore, it looks like we presume users will use object stores, >> which is not good. >> Could we make it more generic? Javadoc, method names, … > This is a good point, I have updated the method names and Javadoc. I also > thinking of renaming the class name to PayloadStoreinstead of BlobStore as > blob store still tide to object store as well. To set some context here, I am > proposing this after working with some community form Apache Cassandra who > are working on Cassandra CEP > <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-44%3A+Kafka+integration+for+Cassandra+CDC+using+Sidecar#CEP44:KafkaintegrationforCassandraCDCusingSidecar-LargeBlob>-44 > to handle large CDC and the initial thinking was let’s publish any large cdc > to an object store instead of Kafka this why the naming was suggesting “blob > store” only. > >> 5. It would be good to have some explanation for the purpose of each new >> interface/class, and clear javadoc for each method. > > Updated the KIP with javadocs > >> 6. About the BlobIdGenerator, why do we need it, could you explain more? >> Again, I thought we only need to replace value to a path, and add the >> "large-message" header, so that when consumer reads this message, it'll >> read the path from value and get the original data via BlobStore. Why do we >> need this ID generator? I think users should do the object naming when >> putting the object by themselves, not via another interface. WDYT? > In some cases generating the ID might need some smart work for example to > avoid s3 throttling the recommended way on their doc it to create sub paths > under the original bucket, to decide this we might hash the data to find a > suitable sub-path. > Here is an example of how I would generate an path for s3 file > ``` > public String id(byte[] data) { > String subFolder = topic + "-" + Utils.toPositive(Utils.murmur2(data)) % > distributionFactor // distributionFactor is a config for the Id generator and > it represent the max number of sub-folders under the bucket > return subFolder + “/“ + UUID.randomUUID().toString() > } > ``` > Hope this example clarify a bit. However I do agree here it might not need a > class. I have move it to be part of the store class. > > Please let me know WDYT of the final shape of the KIP now > > Thanks > Omnia > >> On 24 Apr 2025, at 13:31, Luke Chen <show...@gmail.com> wrote: >> >> Hi Omnia, >> >> Thanks for proposing this feature that many users expected. >> >> Some comments: >> 1. It's quite interesting to see the idea of chained >> serializer/deserializer used here. I like it. >> >> 2. It's not clear how the "retry count" comes into play in the KIP. It >> needs more explanation. >> >> 3. What does "LargeMessageFormatter" do in the process? >> I thought all we want to do is to replace the "large value data" into a >> path, and consumers will read the path via blob store class. >> All these should be done in serializer/deserializer, so why do we need the >> formatter? >> >> 4. In the BlobStore, it looks like we presume users will use object stores, >> which is not good. >> Could we make it more generic? Javadoc, method names, ... >> >> 5. It would be good to have some explanation for the purpose of each new >> interface/class, and clear javadoc for each method. >> >> 6. About the BlobIdGenerator, why do we need it, could you explain more? >> Again, I thought we only need to replace value to a path, and add the >> "large-message" header, so that when consumer reads this message, it'll >> read the path from value and get the original data via BlobStore. Why do we >> need this ID generator? I think users should do the object naming when >> putting the object by themselves, not via another interface. WDYT? >> >> Thanks. >> Luke >> >> >> >> >> >> On Thu, Apr 10, 2025 at 9:31 PM Omnia Ibrahim <o.g.h.ibra...@gmail.com> >> wrote: >> >>> Hi there I would like to start discussions on >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1159%3A+Large+message+reference+based+Serializer >>> >>> Thanks >>> Omnia >>> >>> >