Hi, Hudi community!

I really concerned about current performance of Flink stream processing
using Hudi, and want to propose its optimization. I've already made some
work in this direction:
https://github.com/apache/hudi/pull/12054
https://github.com/apache/hudi/pull/12104
https://github.com/apache/hudi/pull/12113
https://github.com/apache/hudi/pull/12120
(merged to master branch)
But, from my point of view, the next step should be related to Kryo serde
between Flink operators. And this kind of changes requires discussion in
advance.

To support my proposal I prepared proof of concept, and made performance
profiling. I have to use some illustrations to show results, so used PR in
my fork for it:
https://github.com/geserdugarov/hudi-open-source/pull/7
If you don't mind, please, take a look.

For proof of concept, I've passed between operators
Tuple2<Tuple2<BinaryStringData, BinaryStringData>, RowData>, which is
((recordKey, partition), RowData), instead of HoodieRecord. In a result, I
decreased amount of data passed between operators from 19.4 GB to 13.1 GB
(32.5%), and decreased total processing time from 247 s to 208 s (15.8%),
which is really significant. But I made conversion from RowData to Avro
GenericRecord twice, tried only simple bucket index, etc. And there are
many things like these to think about.

If you think, that we can try to make a progress in this direction, please,
let me know. And if you support this suggestion, I don't know should I
create new RFC and start to work on design details of such optimization, or
use RFC-46: Optimize Record Payload handling, because RFC-46 is really
overloaded:
https://issues.apache.org/jira/browse/HUDI-3217

  --
  Best regards,
  Geser Dugarov

Reply via email to