[ 
https://issues.apache.org/jira/browse/HUDI-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Bukhner updated HUDI-8934:
-------------------------------
    Summary: [RFC-87] Avro elimination for Flink writer  (was: [RFC-85] Avro 
elimination for Flink writer)

> [RFC-87] Avro elimination for Flink writer
> ------------------------------------------
>
>                 Key: HUDI-8934
>                 URL: https://issues.apache.org/jira/browse/HUDI-8934
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: flink, performance
>            Reporter: Mark Bukhner
>            Assignee: Mark Bukhner
>            Priority: Major
>
> [WIP]
> Inspired by RFC-84 - 
> [[#HUDI-8920]|https://issues.apache.org/jira/browse/HUDI-8920]: there is an 
> opinion Avro is not the best choice for Hudi. It requires an extra ser/de 
> operations not only between Flink operators (will be fixed by RFC-84).
> I decided to benchmark a POC version with native Flink's RowData writer for 
> Hudi to verify this opinion. It was simple enough, because Hudi already has 
> native RowData to Parquet writer for append mode, I reused this writer and 
> two bottlenecks were approved:
> 1. Hudi performs *a lot of Avro ser/de operations* in writer runtime.
> *add profiler proof*
> 2. Hudi stores Avro recrods as List<HoodieRecords>, it causes a *huge GC 
> pressure* on writer runtime, on my benchmarks garbage collection is about 30% 
> of all hudi writer runtime.
> *add profiler proof*
> Data were used for benchmark: first 60 million of TPC-H lineitem table read 
> from Kafka topic with 8 partitions with parallelism 8.
> It shows that ...
> I have a POC version we are already testing in out cloud environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to