Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23056 )

Change subject: KUDU-1261 introduce Flatbuffers into thirdparty
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG@15
PS4, Line 15: serializing and de-serializing
            : of array cells' data
> question - What was the motivation behind starting to think about an 
> alternative tool (and not protobuf) for serdes-ing of array cells' data?

The idea is to avoid double ser/des-ing of the data sent over the wire in the 
RowOperationsPB.indirect_data field.  Ideally, it would be great to avoid 
ser/des-ing of the data at all and being able to load the received data into 
the memory directly (a.k.a. zero-copy) and access it right away in the run-time 
data structures.

You could get more sense of the motivation for not using Protobuf by looking at 
the the perf benchmark results below.

For getting more context on various serialization/deserialization protocols, 
you might take a look at this outdated, but still very informative post: 
https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html  You could 
also find more information on the Apache Arrow project from its official 
Website and in many other blogs available on the Internet as of now.

> What about data types other than 'array'? Do those also need to be serdes-ed 
> using alternative tool?

In the scope of this project, no other nested types beyond one-dimensional 
arrays of scalar types are supported.

For next iteration, I'm thinking about using Apache Arrow for representation of 
arbitrarily nested types.  However, that's a much larger project since it also 
involves (a) updating the internal run-time data representation that's 
currently used in the Kudu's runtime (b) in the Dremel's terminology, 
introducing so-called "dimension level" in addition to the "repetition level" 
for the data stored and read from the disk.  This work on supporting 
one-dimensional arrays has just introduced some elements of the "repetition 
level" up to some extent with adding a new array data block type: 
https://gerrit.cloudera.org/#/c/22058/


http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG@15
PS4, Line 15: serializing and de-serializing
            : of array cells' data
> I see the patch is merged. Has this and my other comments been addressed?
I'm sorry: for some reason I completely missed addressing your comments.  I 
recall starting my write-up to respond to one of these, but it seems I haven't 
even saved the draft comment.  I tend not to save a particular draft comment 
before it's complete because it might be sent as-is incomplete if switching 
back-and-forth and then sending the rest of the review.  And maybe, that's was 
exactly when my laptop was forcefully restarted due to the company's security 
policies overnight.

Thank you for the reminder -- I'm posting responses to your questions with this 
post.


http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG@20
PS4, Line 20: Arrow IPC format for
            : serdes-ing data of nested type cells'
> question - Any gain in using Apache IPC for simple types?

I guess you meant Apache Arrow IPC format, also known as Arrow IPC format.

If so and if by "gain" you meant various benefits comparing Arrow IPC format 
with Flatbuffers: yes, the Arrow IPC format becomes a better fit once switching 
internal data representation in the runtime from ColumnBlock and other related 
structures into corresponding Arrow structures because with that it's possible 
to use the Arrow framework for packing and unpacking the data into the IPC 
format when sending/receiving the data over the network, and in case of Arrow 
the cost of ser/des is zero since virtually no serialization/de-serialization 
is performed.  The transferred data can be directly loaded into memory and 
accessed via corresponding shims/facades structures, and all is done 
automatically by the Arrow framework.  And that's so regardless of data types.


http://gerrit.cloudera.org:8080/#/c/23056/4//COMMIT_MSG@27
PS4, Line 27: Flatbuffers
> I am not sure but do you think that can lead to a bigger binary size?

I guess you meant "bigger" compared to Protobuf messages?

In my tests I found Flatbuffers serialized messages even smaller than ProtoBuf 
serialized messages for similar arrays of integers, with the factor of around 
0.7.  You could see it yourself if running corresponding benchmark scenarios in 
serdes-test.cc

As for comparisons for other mixed types, you could find more information on 
that readily available in the public sources in the Internet.  Below is one 
reference for your convenience:
  https://flatbuffers.dev/benchmarks/

Overall, ProtoBuf serialized messages might be smaller than similar Flatbuffers 
messages because of using variable-length encoding for integers, but as for the 
"extra metadata" you referred to, it seems Flatbuffers add even less than 
ProtoBuf at least for arrays as represented in the corresponding arrays.proto 
and arrays.fbs IDL files in this patch.


http://gerrit.cloudera.org:8080/#/c/23056/4/src/kudu/benchmarks/serdes/serdes-test.cc
File src/kudu/benchmarks/serdes/serdes-test.cc:

http://gerrit.cloudera.org:8080/#/c/23056/4/src/kudu/benchmarks/serdes/serdes-test.cc@327
PS4, Line 327: Flatbuffers
> nit: Protobuf
Whoops, sorry -- I've missed this.  I'm planning to address this in a follow-up 
patch, along with addressing Abhishek's feedback on unifying and parameterizing 
some of the test scenarios in this file.



--
To view, visit http://gerrit.cloudera.org:8080/23056
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I89c697b8d80cbbd2af4233d16806a230cedaa81a
Gerrit-Change-Number: 23056
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 05 Sep 2025 18:07:05 +0000
Gerrit-HasComments: Yes

Reply via email to