[
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084380#comment-17084380
]
Prashant Wason commented on HUDI-797:
-------------------------------------
Conversion step is uses a HashMap to find the field from the incoming record
using the field name. HashMap lookups are slow (relative to array index lookup).
One potential improvement is to lookup fields using the integer position of the
field in the AVRO schema.
How will this work:
----------------------
Assume the incoming schema has 2 fields. On parsing this schema, the Schema
object will have 2 fields with positions 0 and 1. When we rewrite the schema to
create HUDI schema
([HoodieAvroUtils::addMetadataFields()|[https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java#L103]])
we will add some extra fields. Now the schema object may not have the 2
incoming fields at position 0 and 1. This is the reason we need to lookup the
fields using their name.
Suppose while addMetadataFields we ensure that the incoming schema fields
retain their positions (by adding HUDI metadata fields at the end). Then we can
easily lookup fields using their position and assign them using their position
too.
> Improve performance of rewriting AVRO records in
> HoodieAvroUtils::rewriteRecord
> -------------------------------------------------------------------------------
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO
> encoded records. These records have a [schema
> |https://avro.apache.org/docs/current/spec.html]which is determined by the
> dataset user and provided to HUDI during the writing process (as part of
> HUDIWriteConfig). The records are finally saved in [parquet
> |https://parquet.apache.org/]files which include the schema (in parquet
> format) in the footer of individual files.
>
> HUDI design requires addition of some metadata fields to all incoming records
> to aid in book-keeping and indexing. To achieve this, the incoming schema
> needs to be modified by adding the HUDI metadata fields and is called the
> HUDI schema for the dataset. Each incoming record is then re-written to
> translate it from the incoming schema into the HUDI schema. Re-writing the
> incoming records to a new schema is reasonably fast as it looks up all fields
> in the incoming record and adds them to a new record. But since this takes
> place for each and every incoming record.
> When ingestion large datasets (billions of records) or large number of
> datasets, even small improvements in the CPU-bound conversion can translate
> into notable improvement in compute efficiency.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)