[
https://issues.apache.org/jira/browse/HUDI-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Guo updated HUDI-3217:
----------------------------
Description:
These are the gaps that we need to fill for the new record merging API
* [P0]HUDI-6702 Extend merge API to support all merging operations (inserts,
updates and deletes, including customized getInsertValue)
** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema
oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props)
* [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic
** Add a new argument of merge mode (pre-combine, or update) to the merge API
for customized dedup (or merging of log records?), instead of using
OperationModeAwareness
* [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
** HoodieRecordCompatibilityInterface provides adaption among any
representation type (Avro, Row, etc.)
** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For
Avro log block, needs conversion from Avro to Row for Spark
* [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
** HoodieRecord does not merely wrap engine-specific data structure; it also
contains Java objects to store record key, location, etc.
** For end-to-end row writing, could we just use engine-specific type
InternalRow instead of HoodieRecord<InternalRow> by appending key, location,
etc. as row fields, to better leverage Spark's optimization on DataFrame with
InternalRow?
* [P0] Bug fixes
** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values
These are nice-to-haves but not on the critical path
* [P1] Make merge logic engine-agnostic
** Different engines need to implement the merging logic based in the
engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.)
different HoodieRecordMerger implementation class. Providing getField API from
the HoodieRecord could allow engine-agnostic merge logic.
* [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API
** Only necessary if we use parquet as the base and log file format in MDT
* [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
** As we will implement a new file-group readers and writers, we do not need
to fix existing readers now
— OLD PLAN —
Currently Hudi is biased t/w assumption of particular payload representation
(Avro), long-term we would like to steer away from this to keep the record
payload be completely opaque, so that
# We can keep record payload representation engine-specific
# Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific >
Binary)
h2. *Proposal*
*Phase 2: Revisiting Record Handling*
{_}T-shirt{_}: 2-2.5 weeks
{_}Goal{_}: Avoid tight coupling with particular record representation on the
Read Path (currently Avro) and enable
* Revisit RecordPayload APIs
*
** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs
replacing w/ new “opaque” APIs (not returning Avro payloads)
** Rebase RecordPayload hierarchy to be engine-specific:
*** Common engine-specific base abstracting common functionality (Spark,
Flink, Java)
*** Each feature-specific semantic will have to implement for all engines
** Introduce new APIs
*** To access keys (record, partition)
*** To convert record to Avro (for BWC)
* Revisit RecordPayload handling
** In WriteHandles
*** API will be accepting opaque RecordPayload (no Avro conversion)
*** Can do (opaque) record merging if necessary
*** Passes RP as is to FileWriter
** In FileWriters
*** Will accept RecordPayload interface
*** Should be engine-specific (to handle internal record representation
** In RecordReaders
*** API will be providing opaque RecordPayload (no Avro conversion)
was:
These are the gaps that we need to fill for the new record merging API
* [P0]HUDI-6702 Extend merge API to support all merging operations (inserts,
updates and deletes, including customized getInsertValue)
** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema
oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props)
* [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic
** Add a new argument of merge mode (pre-combine, or update) to the merge API
for customized dedup (or merging of log records?), instead of using
OperationModeAwareness
* [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
** HoodieRecordCompatibilityInterface provides adaption among any
representation type (Avro, Row, etc.)
** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For
Avro log block, needs conversion from Avro to Row for Spark
* [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
** HoodieRecord does not merely wrap engine-specific data structure; it also
contains Java objects to store record key, location, etc.
** For end-to-end row writing, could we just use engine-specific type
InternalRow instead of HoodieRecord<InternalRow> by appending key, location,
etc. as row fields, to better leverage Spark's optimization on DataFrame with
InternalRow?
* [P0] Bug fixes
** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values
These are nice-to-haves but not on the critical path
* * [P1] Make merge logic engine-agnostic
** Different engines need to implement the merging logic based in the
engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.)
different HoodieRecordMerger implementation class. Providing getField API from
the HoodieRecord could allow engine-agnostic merge logic.
* [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API
** Only necessary if we use parquet as the base and log file format in MDT
* [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
** As we will implement a new file-group readers and writers, we do not need
to fix existing readers now
— OLD PLAN —
Currently Hudi is biased t/w assumption of particular payload representation
(Avro), long-term we would like to steer away from this to keep the record
payload be completely opaque, so that
# We can keep record payload representation engine-specific
# Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific >
Binary)
h2. *Proposal*
*Phase 2: Revisiting Record Handling*
{_}T-shirt{_}: 2-2.5 weeks
{_}Goal{_}: Avoid tight coupling with particular record representation on the
Read Path (currently Avro) and enable
* Revisit RecordPayload APIs
*
** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs
replacing w/ new “opaque” APIs (not returning Avro payloads)
** Rebase RecordPayload hierarchy to be engine-specific:
*** Common engine-specific base abstracting common functionality (Spark,
Flink, Java)
*** Each feature-specific semantic will have to implement for all engines
** Introduce new APIs
*** To access keys (record, partition)
*** To convert record to Avro (for BWC)
* Revisit RecordPayload handling
** In WriteHandles
*** API will be accepting opaque RecordPayload (no Avro conversion)
*** Can do (opaque) record merging if necessary
*** Passes RP as is to FileWriter
** In FileWriters
*** Will accept RecordPayload interface
*** Should be engine-specific (to handle internal record representation
** In RecordReaders
*** API will be providing opaque RecordPayload (no Avro conversion)
> RFC-46: Optimize Record Payload handling
> ----------------------------------------
>
> Key: HUDI-3217
> URL: https://issues.apache.org/jira/browse/HUDI-3217
> Project: Apache Hudi
> Issue Type: Epic
> Components: storage-management, writer-core
> Reporter: Alexey Kudinkin
> Assignee: Ethan Guo
> Priority: Critical
> Labels: hudi-umbrellas, pull-request-available
> Fix For: 1.0.0
>
>
> These are the gaps that we need to fill for the new record merging API
> * [P0]HUDI-6702 Extend merge API to support all merging operations (inserts,
> updates and deletes, including customized getInsertValue)
> ** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older,
> Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema,
> TypedProperties props)
> * [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic
> ** Add a new argument of merge mode (pre-combine, or update) to the merge
> API for customized dedup (or merging of log records?), instead of using
> OperationModeAwareness
> * [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
> ** HoodieRecordCompatibilityInterface provides adaption among any
> representation type (Avro, Row, etc.)
> ** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink).
> For Avro log block, needs conversion from Avro to Row for Spark
> * [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
> ** HoodieRecord does not merely wrap engine-specific data structure; it also
> contains Java objects to store record key, location, etc.
> ** For end-to-end row writing, could we just use engine-specific type
> InternalRow instead of HoodieRecord<InternalRow> by appending key, location,
> etc. as row fields, to better leverage Spark's optimization on DataFrame with
> InternalRow?
> * [P0] Bug fixes
> ** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values
> These are nice-to-haves but not on the critical path
> * [P1] Make merge logic engine-agnostic
> ** Different engines need to implement the merging logic based in the
> engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.)
> different HoodieRecordMerger implementation class. Providing getField API
> from the HoodieRecord could allow engine-agnostic merge logic.
> * [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API
> ** Only necessary if we use parquet as the base and log file format in MDT
> * [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
> ** As we will implement a new file-group readers and writers, we do not need
> to fix existing readers now
> — OLD PLAN —
> Currently Hudi is biased t/w assumption of particular payload representation
> (Avro), long-term we would like to steer away from this to keep the record
> payload be completely opaque, so that
> # We can keep record payload representation engine-specific
> # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific >
> Binary)
> h2. *Proposal*
>
> *Phase 2: Revisiting Record Handling*
> {_}T-shirt{_}: 2-2.5 weeks
> {_}Goal{_}: Avoid tight coupling with particular record representation on the
> Read Path (currently Avro) and enable
> * Revisit RecordPayload APIs
> *
> ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs
> replacing w/ new “opaque” APIs (not returning Avro payloads)
> ** Rebase RecordPayload hierarchy to be engine-specific:
> *** Common engine-specific base abstracting common functionality (Spark,
> Flink, Java)
> *** Each feature-specific semantic will have to implement for all engines
> ** Introduce new APIs
> *** To access keys (record, partition)
> *** To convert record to Avro (for BWC)
> * Revisit RecordPayload handling
> ** In WriteHandles
> *** API will be accepting opaque RecordPayload (no Avro conversion)
> *** Can do (opaque) record merging if necessary
> *** Passes RP as is to FileWriter
> ** In FileWriters
> *** Will accept RecordPayload interface
> *** Should be engine-specific (to handle internal record representation
> ** In RecordReaders
> *** API will be providing opaque RecordPayload (no Avro conversion)
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)