[jira] [Updated] (HUDI-3217) RFC-46: Optimize Record Payload handling

Ethan Guo (Jira) Mon, 28 Aug 2023 19:20:36 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ethan Guo updated HUDI-3217:
----------------------------
    Description: 
These are the gaps that we need to fill for the new record merging API

* [P0]HUDI-6702 Extend merge API to support all merging operations (inserts, 
updates and deletes, including customized getInsertValue)
 ** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema 
oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props)
* [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic
 ** Add a new argument of merge mode (pre-combine, or update) to the merge API 
for customized dedup (or merging of log records?), instead of using 
OperationModeAwareness 
* [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
 ** HoodieRecordCompatibilityInterface provides adaption among any 
representation type (Avro, Row, etc.)
 ** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For 
Avro log block, needs conversion from Avro to Row for Spark
* [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
 ** HoodieRecord does not merely wrap engine-specific data structure; it also 
contains Java objects to store record key, location, etc.
 ** For end-to-end row writing, could we just use engine-specific type 
InternalRow instead of HoodieRecord<InternalRow> by appending key, location, 
etc. as row fields, to better leverage Spark's optimization on DataFrame with 
InternalRow? 
* [P0] Bug fixes
 ** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values

These are nice-to-haves but not on the critical path

* [P1] Make merge logic engine-agnostic
 ** Different engines need to implement the merging logic based in the 
engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) 
different HoodieRecordMerger implementation class. Providing getField API from 
the HoodieRecord could allow engine-agnostic merge logic.
* [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API
 ** Only necessary if we use parquet as the base and log file format in MDT
* [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
 ** As we will implement a new file-group readers and writers, we do not need 
to fix existing readers now

— OLD PLAN —

Currently Hudi is biased t/w assumption of particular payload representation 
(Avro), long-term we would like to steer away from this to keep the record 
payload be completely opaque, so that
 # We can keep record payload representation engine-specific
 # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > 
Binary)

h2. *Proposal*

 
*Phase 2: Revisiting Record Handling*
{_}T-shirt{_}: 2-2.5 weeks
{_}Goal{_}: Avoid tight coupling with particular record representation on the 
Read Path (currently Avro) and enable
  * Revisit RecordPayload APIs
 * 
 ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs 
replacing w/ new “opaque” APIs (not returning Avro payloads)
 ** Rebase RecordPayload hierarchy to be engine-specific:
 *** Common engine-specific base abstracting common functionality (Spark, 
Flink, Java)
 *** Each feature-specific semantic will have to implement for all engines
 ** Introduce new APIs
 *** To access keys (record, partition)
 *** To convert record to Avro (for BWC)
 * Revisit RecordPayload handling
 ** In WriteHandles 
 *** API will be accepting opaque RecordPayload (no Avro conversion)
 *** Can do (opaque) record merging if necessary
 *** Passes RP as is to FileWriter
 ** In FileWriters
 *** Will accept RecordPayload interface
 *** Should be engine-specific (to handle internal record representation
 ** In RecordReaders
 *** API will be providing opaque RecordPayload (no Avro conversion)

 

 

  was:
These are the gaps that we need to fill for the new record merging API

* [P0]HUDI-6702 Extend merge API to support all merging operations (inserts, 
updates and deletes, including customized getInsertValue)
 ** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema 
oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props)
* [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic
 ** Add a new argument of merge mode (pre-combine, or update) to the merge API 
for customized dedup (or merging of log records?), instead of using 
OperationModeAwareness 
* [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
 ** HoodieRecordCompatibilityInterface provides adaption among any 
representation type (Avro, Row, etc.)
 ** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For 
Avro log block, needs conversion from Avro to Row for Spark
* [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
 ** HoodieRecord does not merely wrap engine-specific data structure; it also 
contains Java objects to store record key, location, etc.
 ** For end-to-end row writing, could we just use engine-specific type 
InternalRow instead of HoodieRecord<InternalRow> by appending key, location, 
etc. as row fields, to better leverage Spark's optimization on DataFrame with 
InternalRow? 
* [P0] Bug fixes
 ** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values

These are nice-to-haves but not on the critical path

* * [P1] Make merge logic engine-agnostic
 ** Different engines need to implement the merging logic based in the 
engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) 
different HoodieRecordMerger implementation class. Providing getField API from 
the HoodieRecord could allow engine-agnostic merge logic.
* [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API
 ** Only necessary if we use parquet as the base and log file format in MDT
* [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
 ** As we will implement a new file-group readers and writers, we do not need 
to fix existing readers now

— OLD PLAN —

Currently Hudi is biased t/w assumption of particular payload representation 
(Avro), long-term we would like to steer away from this to keep the record 
payload be completely opaque, so that
 # We can keep record payload representation engine-specific
 # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > 
Binary)

h2. *Proposal*

 
*Phase 2: Revisiting Record Handling*
{_}T-shirt{_}: 2-2.5 weeks
{_}Goal{_}: Avoid tight coupling with particular record representation on the 
Read Path (currently Avro) and enable
  * Revisit RecordPayload APIs
 * 
 ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs 
replacing w/ new “opaque” APIs (not returning Avro payloads)
 ** Rebase RecordPayload hierarchy to be engine-specific:
 *** Common engine-specific base abstracting common functionality (Spark, 
Flink, Java)
 *** Each feature-specific semantic will have to implement for all engines
 ** Introduce new APIs
 *** To access keys (record, partition)
 *** To convert record to Avro (for BWC)
 * Revisit RecordPayload handling
 ** In WriteHandles 
 *** API will be accepting opaque RecordPayload (no Avro conversion)
 *** Can do (opaque) record merging if necessary
 *** Passes RP as is to FileWriter
 ** In FileWriters
 *** Will accept RecordPayload interface
 *** Should be engine-specific (to handle internal record representation
 ** In RecordReaders
 *** API will be providing opaque RecordPayload (no Avro conversion)

 

 


> RFC-46: Optimize Record Payload handling
> ----------------------------------------
>
>                 Key: HUDI-3217
>                 URL: https://issues.apache.org/jira/browse/HUDI-3217
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: storage-management, writer-core
>            Reporter: Alexey Kudinkin
>            Assignee: Ethan Guo
>            Priority: Critical
>              Labels: hudi-umbrellas, pull-request-available
>             Fix For: 1.0.0
>
>
> These are the gaps that we need to fill for the new record merging API
> * [P0]HUDI-6702 Extend merge API to support all merging operations (inserts, 
> updates and deletes, including customized getInsertValue)
>  ** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, 
> Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema, 
> TypedProperties props)
> * [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic
>  ** Add a new argument of merge mode (pre-combine, or update) to the merge 
> API for customized dedup (or merging of log records?), instead of using 
> OperationModeAwareness 
> * [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
>  ** HoodieRecordCompatibilityInterface provides adaption among any 
> representation type (Avro, Row, etc.)
>  ** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). 
> For Avro log block, needs conversion from Avro to Row for Spark
> * [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
>  ** HoodieRecord does not merely wrap engine-specific data structure; it also 
> contains Java objects to store record key, location, etc.
>  ** For end-to-end row writing, could we just use engine-specific type 
> InternalRow instead of HoodieRecord<InternalRow> by appending key, location, 
> etc. as row fields, to better leverage Spark's optimization on DataFrame with 
> InternalRow? 
> * [P0] Bug fixes
>  ** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values
> These are nice-to-haves but not on the critical path
> * [P1] Make merge logic engine-agnostic
>  ** Different engines need to implement the merging logic based in the 
> engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) 
> different HoodieRecordMerger implementation class. Providing getField API 
> from the HoodieRecord could allow engine-agnostic merge logic.
> * [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API
>  ** Only necessary if we use parquet as the base and log file format in MDT
> * [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
>  ** As we will implement a new file-group readers and writers, we do not need 
> to fix existing readers now
> — OLD PLAN —
> Currently Hudi is biased t/w assumption of particular payload representation 
> (Avro), long-term we would like to steer away from this to keep the record 
> payload be completely opaque, so that
>  # We can keep record payload representation engine-specific
>  # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > 
> Binary)
> h2. *Proposal*
>  
> *Phase 2: Revisiting Record Handling*
> {_}T-shirt{_}: 2-2.5 weeks
> {_}Goal{_}: Avoid tight coupling with particular record representation on the 
> Read Path (currently Avro) and enable
>   * Revisit RecordPayload APIs
>  * 
>  ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs 
> replacing w/ new “opaque” APIs (not returning Avro payloads)
>  ** Rebase RecordPayload hierarchy to be engine-specific:
>  *** Common engine-specific base abstracting common functionality (Spark, 
> Flink, Java)
>  *** Each feature-specific semantic will have to implement for all engines
>  ** Introduce new APIs
>  *** To access keys (record, partition)
>  *** To convert record to Avro (for BWC)
>  * Revisit RecordPayload handling
>  ** In WriteHandles 
>  *** API will be accepting opaque RecordPayload (no Avro conversion)
>  *** Can do (opaque) record merging if necessary
>  *** Passes RP as is to FileWriter
>  ** In FileWriters
>  *** Will accept RecordPayload interface
>  *** Should be engine-specific (to handle internal record representation
>  ** In RecordReaders
>  *** API will be providing opaque RecordPayload (no Avro conversion)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3217) RFC-46: Optimize Record Payload handling

Reply via email to