[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

Vinoth Chandar (Jira) Tue, 30 Apr 2024 14:27:18 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842466#comment-17842466
 ]


Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:26 PM:
---------------------------------------------------------------

h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) and a new "retain" command block - 
in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 
{code:java}
pointer data block {
   [
    {fg1, logfile X, ..},
    {fg2, logfile Y, ..},
    {fg3, logfile Z, ..}
   ]
}
{code}
In this approach, instead of redistributing the records from file groups 
f1,f2,f3 to f4,f5 - we will
 * log pointer data blocks to f4, f5 , pointing back to new log files W wrote 
to file groups f1,f2,f3. (X,Y,Z in the example above) 
 * log retain command blocks to f1 (with logfilex),f2 (with logfileY),f3 (with 
logfile z) to indicate to the cleaner service, that these log files are 
supposed to be retained for later, so it can be skipped.
 * When f4,f5 is compacted, these log files  will be reconciled to new file 
slices in f4, f5 
 * When the file slices in f4 and f5 are cleaned, the pointers are followed and 
the retained logfiles in f1,f2,f3 can also be cleaned. (note: need some 
reference counting in case f4 is cleaned while f5 is not)

Note that there is a 1:N relationship between pointer data block and log files. 
For e.g both f4, f5's pointer data blocks will be pointing to all three files 
X,Y,Z. 

A snapshot read of the file groups f4, f5 need to carefully filter records to 
avoid exposing duplicate records to the reader. 


 * merges the log files pointed to by pointer data blocks based on keys i.e 
only delete/update records in the base file for keys that match from the 
pointed log files.
 * inserts need to be handled with care and need to be distinguishable from an 
update e.g a record in pointed log files for f4 can either be an insert or a 
update to a record that is now clustered into f5. 
 * storage format also needs to store a bitmap indicating which records are 
inserts vs updates in the pointed to log files. Once such a list of available, 
then the reader of f4/f5 can split the inserts amongst them, by using a hash 
mod of the insert key for e.g i.e f4 will get insert 0, 2, 4, 6 ... in log file 
x,y,z. while f5 will get inserts 1,3,5,7 in log files x,y,z



h3. Pros: 
 * Truly non-blocking, C and W can go at their own pace, complete without much 
additional overhead since the pointer blocks are pretty light to add. 
 * Works even for high throughput streaming scenarios where there is not much 
time or writer to be reconciling (e.g Flink)

h3. Cons:
 * More merge costs on the read/query side (although can be very tolerable for 
same cases approach 1 works well for)  
 * More complexity and new storage changes. 


was (Author: vc):
h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 


{code:java}
pointer data block {
   [
    {fg1, logfileX, ..},
    {fg2, logfileY, ..},
    {fg3, logfileX, ..}
   ]
}
{code}

In this approach, instead of redistributing the records from 

 

 

> Support updates during clustering
> ---------------------------------
>
>                 Key: HUDI-1045
>                 URL: https://issues.apache.org/jira/browse/HUDI-1045
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: clustering, table-service
>            Reporter: leesf
>            Assignee: Vinoth Chandar
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3, 
> concurrently while a clustering service C  reclusters them into  f4, f5. 
> Goals
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals 
>  * Strictly the sort order achieved by clustering, in face of updates (e.g 
> updates change clustering field values, causing output clustering file groups 
> to be not fully sorted by those fields)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

Reply via email to