[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

Vinoth Chandar (Jira) Tue, 30 Apr 2024 13:28:05 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842465#comment-17842465
 ]


Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM:
---------------------------------------------------------------

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where the overlap between C and W is rather small.
 # No extra read amplification for queries, W/C absorbs tha cost. 

{*}Cons{*}:
 # Can be pretty wasteful in continuous writers or with high overlap between C 
and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
 # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. 


was (Author: vc):
h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where the overlap between C and W is rather small.
 # No extra read amplification for queries, W/C absorbs tha cost. 

{*}Cons{*}:
 # Can be pretty wasteful in continuous writers or with high overlap between C 
and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
 # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. 

 

 

 

> Support updates during clustering
> ---------------------------------
>
>                 Key: HUDI-1045
>                 URL: https://issues.apache.org/jira/browse/HUDI-1045
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: clustering, table-service
>            Reporter: leesf
>            Assignee: Vinoth Chandar
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

Reply via email to