[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

Vinoth Chandar (Jira) Tue, 30 Apr 2024 14:02:05 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842466#comment-17842466
 ]


Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:01 PM:
---------------------------------------------------------------

h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 


{code:java}
pointer data block {
   [
    {fg1, logfileX, ..},
    {fg2, logfileY, ..},
    {fg3, logfileX, ..}
   ]
}
{code}

In this approach, instead of redistributing the records from 

 

 


was (Author: vc):
h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, without 

> Support updates during clustering
> ---------------------------------
>
>                 Key: HUDI-1045
>                 URL: https://issues.apache.org/jira/browse/HUDI-1045
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: clustering, table-service
>            Reporter: leesf
>            Assignee: Vinoth Chandar
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3, 
> concurrently while a clustering service C  reclusters them into  f4, f5. 
> Goals
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals 
>  * Strictly the sort order achieved by clustering, in face of updates (e.g 
> updates change clustering field values, causing output clustering file groups 
> to be not fully sorted by those fields)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

Reply via email to