[GitHub] [hudi] nsivabalan commented on a diff in pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

GitBox Mon, 02 Jan 2023 14:57:13 -0800


nsivabalan commented on code in PR #6612:
URL: https://github.com/apache/hudi/pull/6612#discussion_r1060205868



##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  
contributor license agreements. See the NOTICE file distributed with  this work 
for additional information regarding copyright ownership.  The ASF licenses 
this file to You under the Apache License, Version 2.0  (the "License"); you 
may not use this file except in compliance with  the License. You may obtain a 
copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" 
spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: 
border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; 
display: block; break-inside: avoid; text-align: left; white-space: normal; 
background-image: inherit; background-position: inherit; background-size: 
inherit; background-repeat: inherit; background-attachment: inherit; 
background-origin: inherit; background-clip: inherit; background-color: 
rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 
234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; 
margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; 
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; 
letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; 
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; 
text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> 
http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  
distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT 
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the 
License for the specific language governing permissions and  limitations under 
the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. 
Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization 
and De/compression.  To decrease this cost, a 'surgery' is introduced, which 
rebuilds a new parquet from an old one,  just copying unchanged rowGroups and 
overwriting changed rowGroups when updating parquet files.
+
+## Background
+
+*   Parquet is a columnar storage format. All data is horizontally divided 
into row groups. A row group contains the column chunks of all columns in the 
interval corresponding to this row group. A column chunk is composed of pages, 
which are compression and coding units. 
+
+*   In current version of Hudi, a complex De/serialization and De/compression 
happens every time upserting long tail data on COW, which causes giant CPU/IO 
cost.
+
+*   The purpose of current RFC aims to decrease costs of De/serialization and 
De/compression in upserting.  Try to think about the reality, if we know which 
row groups need to be updated and even more the columns need to be updated in 
these row groups, we can skip much data's de/serialization and de/compression. 
That brings giant improvement.
+
+## Implementation
+
+### How partial update works
+
+a. Add one more member variable(Integer rowGroupId) into the class 
HoodieRecordLocation.
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" 
spellcheck="false" lang="java" cid="n35" mdtype="fences" style="box-sizing: 
border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; 
display: block; break-inside: avoid; text-align: left; white-space: normal; 
background-image: inherit; background-position: inherit; background-size: 
inherit; background-repeat: inherit; background-attachment: inherit; 
background-origin: inherit; background-clip: inherit; background-color: 
rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 
234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; 
margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; 
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; 
letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; 
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; 
text-decoration-thickness: initial; text-dec
 oration-style: initial; text-decoration-color: initial;">public class 
HoodieRecordLocation implements Serializable {
+ protected String instantTime;
+ protected String fileId;
+ /**
+ * the index of key in parquet rowGroup num.
+ */
+ protected Integer rowGroupNum;
+ }</pre>
+
+b. Number of rowgroup of a Parquet starts from 0 which continously increases 
util BlockSize reaches `hoodie.parquet.block.size`.  Since every record in 
parquet belongs to a rowgroup, we can simply use parquet API to locate rowgroup 
num of new record which needs to be written into corresponding parquet file, 
and then record rowgroup num into hoodieRecordLocation of each hoodieRecord.  
HoodieRecordLocations will be collected into WriteStatus which will be updated 
to the index on batch.
+
+c. At phase of tagging index, rowgroup num will be queried out, so that they 
can be used to accelerate updating files.
+
+Concrete flow of upserting is as below:
+
+![4.jpg](4.jpg)
+
+### steps of writing a parquet file on cow
+
+1.  (upserting) data preparing
+
+    At phase of tag indexing, find out 
`HoodieRecord.currentLocation.rowGroupNum` of updating records, if rowgroup num 
is empty, record does implicitly not exists, which means current operation is a 
INSERT, otherwise DELETE or UPDATE. At next, rowgroup nums are used to make 
grouping by of the updating records so as to collect all rowgroups which should 
be updated.
+
+![1.jpg](1.jpg)
+
+2.  rowgroup updating
+
+    The process of updating rowgroup is divided into 5 steps.
+
+    1.  Deserializing and decompressing the columns which need to be combined 
and assembled into a List<Pari<rowKey,Pari<offset,record>> structure,   where 
offset represents record's row number in rowgroup(every rowgroup's row number 
starts with zero).
+
+    2.  Using HoodieRecordPayload#getInsertValue to deserialize the upserting 
data, then invoking HoodieRecordPayload#getInsertValue to combine the updating 
rows. 
+
+    3.  Converting combined data into column structure, just like 
`[{"name":"zs","age":10},{"name":"ls","age":20}] ==> 
{"name":["sz","ls"],"age":[10,20]}`

Review Comment:
   I would assume w/ impl, we will decide whether to take this path depending 
on the payload impl used. we don't want to incur additional overhead for the 
ones which may not be effective (for eg OverwriteWithLatestAvroPayload, 
DefaultHoodieRecordPayload)



##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,100 @@
+<!--  Licensed to the Apache Software Foundation (ASF) under one or more  
contributor license agreements. See the NOTICE file distributed with  this work 
for additional information regarding copyright ownership.  The ASF licenses 
this file to You under the Apache License, Version 2.0  (the "License"); you 
may not use this file except in compliance with  the License. You may obtain a 
copy of the License at
+
+<pre class="md-fences md-end-block ty-contain-cm modeLoaded" 
spellcheck="false" lang="" cid="n2" mdtype="fences" style="box-sizing: 
border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; 
display: block; break-inside: avoid; text-align: left; white-space: normal; 
background-image: inherit; background-position: inherit; background-size: 
inherit; background-repeat: inherit; background-attachment: inherit; 
background-origin: inherit; background-clip: inherit; background-color: 
rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 
234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; 
margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; 
font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; 
letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; 
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; 
text-decoration-thickness: initial; text-decorati
 on-style: initial; text-decoration-color: initial;"> 
http://www.apache.org/licenses/LICENSE-2.0</pre>
+
+Unless required by applicable law or agreed to in writing, software  
distributed under the License is distributed on an "AS IS" BASIS,  WITHOUT 
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the 
License for the specific language governing permissions and  limitations under 
the License.  -->
+
+# RFC-58: A more effective HoodieMergeHandler for COW table with parquet
+
+## Proposers
+
+*   @loukey-lj
+
+## Approvers
+
+*   @<approver1 github username>
+
+*   @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4790
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+To provide a more effective HoodieMergeHandler for COW table with parquet. 
Hudi rewrite whole parquet file every COW, that costs a lot in De/serialization 
and De/compression.  To decrease this cost, a 'surgery' is introduced, which 
rebuilds a new parquet from an old one,  just copying unchanged rowGroups and 
overwriting changed rowGroups when updating parquet files.
+
+## Background
+
+*   Parquet is a columnar storage format. All data is horizontally divided 
into row groups. A row group contains the column chunks of all columns in the 
interval corresponding to this row group. A column chunk is composed of pages, 
which are compression and coding units. 
+
+*   In current version of Hudi, a complex De/serialization and De/compression 
happens every time upserting long tail data on COW, which causes giant CPU/IO 
cost.
+
+*   The purpose of current RFC aims to decrease costs of De/serialization and 
De/compression in upserting.  Try to think about the reality, if we know which 
row groups need to be updated and even more the columns need to be updated in 
these row groups, we can skip much data's de/serialization and de/compression. 
That brings giant improvement.

Review Comment:
   So, this could be effective only incase of partial updates? In other words, 
for most commonly used payloads like OverwriteWithLatestAvroPayload, 
DefaultHoodieRecordPayload etc, this might cause unnecessary overhead right? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6612: [RFC-58][HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

Reply via email to