Prashant Wason created HUDI-797:
-----------------------------------

             Summary: Improve performance of rewriting AVRO records in 
HoodieAvroUtils::rewriteRecord
                 Key: HUDI-797
                 URL: https://issues.apache.org/jira/browse/HUDI-797
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
            Reporter: Prashant Wason


Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO encoded 
records. These records have a [schema 
|https://avro.apache.org/docs/current/spec.html]which is determined by the 
dataset user and provided to HUDI during the writing process (as part of 
HUDIWriteConfig). The records are finally saved in [parquet 
|https://parquet.apache.org/]files which include the schema (in parquet format) 
in the footer of individual files.

 

HUDI design requires addition of some metadata fields to all incoming records 
to aid in book-keeping and indexing. To achieve this, the incoming schema needs 
to be modified by adding the HUDI metadata fields and is called the HUDI schema 
for the dataset. Each incoming record is then re-written to translate it from 
the incoming schema into the HUDI schema. Re-writing the incoming records to a 
new schema is reasonably fast as it looks up all fields in the incoming record 
and adds them to a new record. But since this takes place for each and every 
incoming record. 

When ingestion large datasets (billions of records) or large number of 
datasets, even small improvements in the CPU-bound conversion can translate 
into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to