Prashant Wason created HUDI-797:
-----------------------------------
Summary: Improve performance of rewriting AVRO records in
HoodieAvroUtils::rewriteRecord
Key: HUDI-797
URL: https://issues.apache.org/jira/browse/HUDI-797
Project: Apache Hudi (incubating)
Issue Type: Improvement
Reporter: Prashant Wason
Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO encoded
records. These records have a [schema
|https://avro.apache.org/docs/current/spec.html]which is determined by the
dataset user and provided to HUDI during the writing process (as part of
HUDIWriteConfig). The records are finally saved in [parquet
|https://parquet.apache.org/]files which include the schema (in parquet format)
in the footer of individual files.
HUDI design requires addition of some metadata fields to all incoming records
to aid in book-keeping and indexing. To achieve this, the incoming schema needs
to be modified by adding the HUDI metadata fields and is called the HUDI schema
for the dataset. Each incoming record is then re-written to translate it from
the incoming schema into the HUDI schema. Re-writing the incoming records to a
new schema is reasonably fast as it looks up all fields in the incoming record
and adds them to a new record. But since this takes place for each and every
incoming record.
When ingestion large datasets (billions of records) or large number of
datasets, even small improvements in the CPU-bound conversion can translate
into notable improvement in compute efficiency.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)