[jira] [Commented] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

Prashant Wason (Jira) Tue, 09 Mar 2021 16:48:04 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298461#comment-17298461
 ]


Prashant Wason commented on HUDI-797:
-------------------------------------

This change did not work and I do not have any alternative.

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> -------------------------------------------------------------------------------
>
>                 Key: HUDI-797
>                 URL: https://issues.apache.org/jira/browse/HUDI-797
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

Reply via email to