[
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298461#comment-17298461
]
Prashant Wason commented on HUDI-797:
-------------------------------------
This change did not work and I do not have any alternative.
> Improve performance of rewriting AVRO records in
> HoodieAvroUtils::rewriteRecord
> -------------------------------------------------------------------------------
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO
> encoded records. These records have a [schema
> |https://avro.apache.org/docs/current/spec.html]which is determined by the
> dataset user and provided to HUDI during the writing process (as part of
> HUDIWriteConfig). The records are finally saved in [parquet
> |https://parquet.apache.org/]files which include the schema (in parquet
> format) in the footer of individual files.
>
> HUDI design requires addition of some metadata fields to all incoming records
> to aid in book-keeping and indexing. To achieve this, the incoming schema
> needs to be modified by adding the HUDI metadata fields and is called the
> HUDI schema for the dataset. Each incoming record is then re-written to
> translate it from the incoming schema into the HUDI schema. Re-writing the
> incoming records to a new schema is reasonably fast as it looks up all fields
> in the incoming record and adds them to a new record. But since this takes
> place for each and every incoming record.
> When ingestion large datasets (billions of records) or large number of
> datasets, even small improvements in the CPU-bound conversion can translate
> into notable improvement in compute efficiency.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)