Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/23056 )
Change subject: KUDU-1261 introduce Flatbuffers into thirdparty ...................................................................... Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG@15 PS4, Line 15: serializing and de-serializing : of array cells' data > question - What was the motivation behind starting to think about an > alternative tool (and not protobuf) for serdes-ing of array cells' data? The idea is to avoid double ser/des-ing of the data sent over the wire in the RowOperationsPB.indirect_data field. Ideally, it would be great to avoid ser/des-ing of the data at all and being able to load the received data into the memory directly (a.k.a. zero-copy) and access it right away in the run-time data structures. You could get more sense of the motivation for not using Protobuf by looking at the the perf benchmark results below. For getting more context on various serialization/deserialization protocols, you might take a look at this outdated, but still very informative post: https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html You could also find more information on the Apache Arrow project from its official Website and in many other blogs available on the Internet as of now. > What about data types other than 'array'? Do those also need to be serdes-ed > using alternative tool? In the scope of this project, no other nested types beyond one-dimensional arrays of scalar types are supported. For next iteration, I'm thinking about using Apache Arrow for representation of arbitrarily nested types. However, that's a much larger project since it also involves (a) updating the internal run-time data representation that's currently used in the Kudu's runtime (b) in the Dremel's terminology, introducing so-called "dimension level" in addition to the "repetition level" for the data stored and read from the disk. This work on supporting one-dimensional arrays has just introduced some elements of the "repetition level" up to some extent with adding a new array data block type: https://gerrit.cloudera.org/#/c/22058/ http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG@15 PS4, Line 15: serializing and de-serializing : of array cells' data > I see the patch is merged. Has this and my other comments been addressed? I'm sorry: for some reason I completely missed addressing your comments. I recall starting my write-up to respond to one of these, but it seems I haven't even saved the draft comment. I tend not to save a particular draft comment before it's complete because it might be sent as-is incomplete if switching back-and-forth and then sending the rest of the review. And maybe, that's was exactly when my laptop was forcefully restarted due to the company's security policies overnight. Thank you for the reminder -- I'm posting responses to your questions with this post. http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG@20 PS4, Line 20: Arrow IPC format for : serdes-ing data of nested type cells' > question - Any gain in using Apache IPC for simple types? I guess you meant Apache Arrow IPC format, also known as Arrow IPC format. If so and if by "gain" you meant various benefits comparing Arrow IPC format with Flatbuffers: yes, the Arrow IPC format becomes a better fit once switching internal data representation in the runtime from ColumnBlock and other related structures into corresponding Arrow structures because with that it's possible to use the Arrow framework for packing and unpacking the data into the IPC format when sending/receiving the data over the network, and in case of Arrow the cost of ser/des is zero since virtually no serialization/de-serialization is performed. The transferred data can be directly loaded into memory and accessed via corresponding shims/facades structures, and all is done automatically by the Arrow framework. And that's so regardless of data types. http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG@27 PS4, Line 27: Flatbuffers > I am not sure but do you think that can lead to a bigger binary size? I guess you meant "bigger" compared to Protobuf messages? In my tests I found Flatbuffers serialized messages even smaller than ProtoBuf serialized messages for similar arrays of integers, with the factor of around 0.7. You could see it yourself if running corresponding benchmark scenarios in serdes-test.cc As for comparisons for other mixed types, you could find more information on that readily available in the public sources in the Internet. Below is one reference for your convenience: https://flatbuffers.dev/benchmarks/ Overall, ProtoBuf serialized messages might be smaller than similar Flatbuffers messages because of using variable-length encoding for integers, but as for the "extra metadata" you referred to, it seems Flatbuffers add even less than ProtoBuf at least for arrays as represented in the corresponding arrays.proto and arrays.fbs IDL files in this patch. http://gerrit.cloudera.org:8080/#/c/23056/4/src/kudu/benchmarks/serdes/serdes-test.cc File src/kudu/benchmarks/serdes/serdes-test.cc: http://gerrit.cloudera.org:8080/#/c/23056/4/src/kudu/benchmarks/serdes/serdes-test.cc@327 PS4, Line 327: Flatbuffers > nit: Protobuf Whoops, sorry -- I've missed this. I'm planning to address this in a follow-up patch, along with addressing Abhishek's feedback on unifying and parameterizing some of the test scenarios in this file. -- To view, visit http://gerrit.cloudera.org:8080/23056 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I89c697b8d80cbbd2af4233d16806a230cedaa81a Gerrit-Change-Number: 23056 Gerrit-PatchSet: 4 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Fri, 05 Sep 2025 18:07:05 +0000 Gerrit-HasComments: Yes
