[
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084378#comment-17084378
]
Prashant Wason commented on HUDI-797:
-------------------------------------
HoodieAvroUtils::rewriteRecord function has two parts:
# Conversion: Convert the record to the newSchema
# Validation: Validate that the converted record is valid as per the newSchema
(uses GenericData.get().validate function from org.apache.avro)
I tested the time taken by the above two parts with a large schema (50 fields)
which converting 100Million records.
Current: Test takes total 74217 msec (1.484 usec / record)
No validation: Test takes total 37168 msec (0.743 usec / record)
This shows that 50% of the time is spend in the conversion part and 50% in the
validation part. This is totally dependent on the schema though - large
complicated schemas (union, records, etc) take longer to verify due to the
increased number of fields to validate or the complexity of validating various
field types (Primitive fields types are trivial to compare using instantOf
operator).
> Improve performance of rewriting AVRO records in
> HoodieAvroUtils::rewriteRecord
> -------------------------------------------------------------------------------
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO
> encoded records. These records have a [schema
> |https://avro.apache.org/docs/current/spec.html]which is determined by the
> dataset user and provided to HUDI during the writing process (as part of
> HUDIWriteConfig). The records are finally saved in [parquet
> |https://parquet.apache.org/]files which include the schema (in parquet
> format) in the footer of individual files.
>
> HUDI design requires addition of some metadata fields to all incoming records
> to aid in book-keeping and indexing. To achieve this, the incoming schema
> needs to be modified by adding the HUDI metadata fields and is called the
> HUDI schema for the dataset. Each incoming record is then re-written to
> translate it from the incoming schema into the HUDI schema. Re-writing the
> incoming records to a new schema is reasonably fast as it looks up all fields
> in the incoming record and adds them to a new record. But since this takes
> place for each and every incoming record.
> When ingestion large datasets (billions of records) or large number of
> datasets, even small improvements in the CPU-bound conversion can translate
> into notable improvement in compute efficiency.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)