[GitHub] [hudi] xushiyan commented on a change in pull request #4697: [HUDI-3318] RFC-46

GitBox Mon, 31 Jan 2022 15:21:34 -0800


xushiyan commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r796143762




##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a 
default representation that many components expect
+when dealing with records (during merge, column value extractions, writing 
into storage, etc). 
+
+While having a single format of the record representation is certainly making 
implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every 
record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` 
for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially 
incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from 
engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of 
the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to 
convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used 
to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it 
with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate 
representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a 
way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an 
_opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the 
access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an 
opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn 
would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components 
to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a 
single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding 
internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` 
, `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any 
user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a 
standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records 
(according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` 
abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of 
avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based 
instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look 
like following:
+
+```java
+interface HoodieRecordCombiningEngine {

Review comment:
       ok so here it should be 
   
   ```suggestion
   class HoodieRecordCombiningEngine {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on a change in pull request #4697: [HUDI-3318] RFC-46

Reply via email to