[ 
https://issues.apache.org/jira/browse/HUDI-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Bukhner updated HUDI-8934:
-------------------------------
    Description: 
[WIP]
Inspired by RFC-84 HUDI-8920: there is an opinion Avro is not the best choice 
for Hudi. It requires an extra ser/de operations not only between Flink 
operators (will be fixed by RFC-84).

I decided to benchmark a POC version with native Flink's RowData writer for 
Hudi. It was simple enough, because Hudi already has native RowData to Parquet 
writer used by append mode, I reused this writer and two bottlenecks were found:

1. Hudi performs *a lot of Avro ser/de operations* in writer runtime.
*add profiler proof*

2. Hudi stores Avro recrods as List<HoodieRecord>, it causes a *huge GC 
pressure* on writer runtime, on my benchmarks garbage collection is about 30% 
of all hudi writer runtime.
*add profiler proof*

As a result I reduced write time from *** to ***:
*add flink ui proofs*

I have a POC version we are already testing in our cloud environment.
h3. My config:
PC: 32CPU 128GiB
Data: 60 million of TPC-H lineitem table records read from Kafka topic with 8 
partitions with parallelism 8 to verify this opinion
Flink: 1.20 Single JM + Sinlge TM: 
Write: Hadoop HDFS: 3.3.1, 9 node cluster
Read: Kafka 2.8, 3 node cluster, 8 partitions

  was:
[WIP]
Inspired by RFC-84 [HUDI-8920]: there is an opinion Avro is not the best choice 
for Hudi. It requires an extra ser/de operations not only between Flink 
operators (will be fixed by RFC-84).

I decided to benchmark a POC version with native Flink's RowData writer for 
Hudi with 60 million of TPC-H lineitem table records read from Kafka topic with 
8 partitions with parallelism 8 to verify this opinion. It was simple enough, 
because Hudi already has native RowData to Parquet writer used by append mode, 
I reused this writer and two bottlenecks were found:

1. Hudi performs *a lot of Avro ser/de operations* in writer runtime.
*add profiler proof*

2. Hudi stores Avro recrods as List<HoodieRecords>, it causes a *huge GC 
pressure* on writer runtime, on my benchmarks garbage collection is about 30% 
of all hudi writer runtime.
*add profiler proof*

As a result I reduced write time from *** to ***:
*add flink ui proofs*

I have a POC version we are already testing in our cloud environment.


> [RFC-87] Avro elimination for Flink writer
> ------------------------------------------
>
>                 Key: HUDI-8934
>                 URL: https://issues.apache.org/jira/browse/HUDI-8934
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: flink, performance
>            Reporter: Mark Bukhner
>            Assignee: Mark Bukhner
>            Priority: Major
>
> [WIP]
> Inspired by RFC-84 HUDI-8920: there is an opinion Avro is not the best choice 
> for Hudi. It requires an extra ser/de operations not only between Flink 
> operators (will be fixed by RFC-84).
> I decided to benchmark a POC version with native Flink's RowData writer for 
> Hudi. It was simple enough, because Hudi already has native RowData to 
> Parquet writer used by append mode, I reused this writer and two bottlenecks 
> were found:
> 1. Hudi performs *a lot of Avro ser/de operations* in writer runtime.
> *add profiler proof*
> 2. Hudi stores Avro recrods as List<HoodieRecord>, it causes a *huge GC 
> pressure* on writer runtime, on my benchmarks garbage collection is about 30% 
> of all hudi writer runtime.
> *add profiler proof*
> As a result I reduced write time from *** to ***:
> *add flink ui proofs*
> I have a POC version we are already testing in our cloud environment.
> h3. My config:
> PC: 32CPU 128GiB
> Data: 60 million of TPC-H lineitem table records read from Kafka topic with 8 
> partitions with parallelism 8 to verify this opinion
> Flink: 1.20 Single JM + Sinlge TM: 
> Write: Hadoop HDFS: 3.3.1, 9 node cluster
> Read: Kafka 2.8, 3 node cluster, 8 partitions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to