[
https://issues.apache.org/jira/browse/HUDI-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Bukhner updated HUDI-8934:
-------------------------------
Description:
[WIP]
Inspired by RFC-84 [HUDI-8920]: there is an opinion Avro is not the best choice
for Hudi. It requires an extra ser/de operations not only between Flink
operators (will be fixed by RFC-84).
I decided to benchmark a POC version with native Flink's RowData writer for
Hudi with 60 million of TPC-H lineitem table records read from Kafka topic with
8 partitions with parallelism 8 to verify this opinion. It was simple enough,
because Hudi already has native RowData to Parquet writer used by append mode,
I reused this writer and two bottlenecks were found:
1. Hudi performs *a lot of Avro ser/de operations* in writer runtime.
*add profiler proof*
2. Hudi stores Avro recrods as List<HoodieRecords>, it causes a *huge GC
pressure* on writer runtime, on my benchmarks garbage collection is about 30%
of all hudi writer runtime.
*add profiler proof*
As a result I reduced write time from *** to ***:
*add flink ui proofs*
I have a POC version we are already testing in our cloud environment.
was:
[WIP]
Inspired by RFC-84 -
[[#HUDI-8920]|https://issues.apache.org/jira/browse/HUDI-8920]: there is an
opinion Avro is not the best choice for Hudi. It requires an extra ser/de
operations not only between Flink operators (will be fixed by RFC-84).
I decided to benchmark a POC version with native Flink's RowData writer for
Hudi with 60 million of TPC-H lineitem table records read from Kafka topic with
8 partitions with parallelism 8 to verify this opinion. It was simple enough,
because Hudi already has native RowData to Parquet writer used by append mode,
I reused this writer and two bottlenecks were found:
1. Hudi performs *a lot of Avro ser/de operations* in writer runtime.
*add profiler proof*
2. Hudi stores Avro recrods as List<HoodieRecords>, it causes a *huge GC
pressure* on writer runtime, on my benchmarks garbage collection is about 30%
of all hudi writer runtime.
*add profiler proof*
As a result I reduced write time from *** to ***:
*add flink ui proofs*
I have a POC version we are already testing in out cloud environment.
> [RFC-87] Avro elimination for Flink writer
> ------------------------------------------
>
> Key: HUDI-8934
> URL: https://issues.apache.org/jira/browse/HUDI-8934
> Project: Apache Hudi
> Issue Type: New Feature
> Components: flink, performance
> Reporter: Mark Bukhner
> Assignee: Mark Bukhner
> Priority: Major
>
> [WIP]
> Inspired by RFC-84 [HUDI-8920]: there is an opinion Avro is not the best
> choice for Hudi. It requires an extra ser/de operations not only between
> Flink operators (will be fixed by RFC-84).
> I decided to benchmark a POC version with native Flink's RowData writer for
> Hudi with 60 million of TPC-H lineitem table records read from Kafka topic
> with 8 partitions with parallelism 8 to verify this opinion. It was simple
> enough, because Hudi already has native RowData to Parquet writer used by
> append mode, I reused this writer and two bottlenecks were found:
> 1. Hudi performs *a lot of Avro ser/de operations* in writer runtime.
> *add profiler proof*
> 2. Hudi stores Avro recrods as List<HoodieRecords>, it causes a *huge GC
> pressure* on writer runtime, on my benchmarks garbage collection is about 30%
> of all hudi writer runtime.
> *add profiler proof*
> As a result I reduced write time from *** to ***:
> *add flink ui proofs*
> I have a POC version we are already testing in our cloud environment.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)