[
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842465#comment-17842465
]
Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM:
---------------------------------------------------------------
h2. [WIP] Approach 1 : Redistribute records from the conflicting file groups
Within the finalize section (done within a table level distributed lock), we
could either have W or C perform the following .
{code:java}
W {
- identify the file groups that have been clustered concurrently by C
- Read out all records written by W, into these conflicting file groups
- Redistribute records based on new records distribution based on C
- finalize W
} {code}
{code:java}
C {
- identify the file groups that have been written to concurrently by W.
- Read out all records written by such W, into conflicting file groups
- Redistribute records based on new records distribution, based on C
- finalize C
}
{code}
h3. Pros:
# Simple to understand/debug, no storage format changes.
# Could work well for cases where the overlap between C and W is rather small.
# No extra read amplification for queries, W/C absorbs tha cost.
{*}Cons{*}:
# Can be pretty wasteful in continuous writers or with high overlap between C
and W, forcing the entire write to be redone effectively (same as writer
failing and retrying like today)
# Particularly more expensive for CoW, where W has paid the cost of merging
columnar base files, with incoming records.
was (Author: vc):
h2. [WIP] Approach 1 : Redistribute records from the conflicting file groups
Within the finalize section (done within a table level distributed lock), we
could either have W or C perform the following .
{code:java}
W {
- identify the file groups that have been clustered concurrently by C
- Read out all records written by W, into these conflicting file groups
- Redistribute records based on new records distribution based on C
- finalize W
} {code}
{code:java}
C {
- identify the file groups that have been written to concurrently by W.
- Read out all records written by such W, into conflicting file groups
- Redistribute records based on new records distribution, based on C
- finalize C
}
{code}
h3. Pros:
# Simple to understand/debug, no storage format changes.
# Could work well for cases where the overlap between C and W is rather small.
# No extra read amplification for queries, W/C absorbs tha cost.
{*}Cons{*}:
# Can be pretty wasteful in continuous writers or with high overlap between C
and W, forcing the entire write to be redone effectively (same as writer
failing and retrying like today)
# Particularly more expensive for CoW, where W has paid the cost of merging
columnar base files, with incoming records.
> Support updates during clustering
> ---------------------------------
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
> Issue Type: Task
> Components: clustering, table-service
> Reporter: leesf
> Assignee: Vinoth Chandar
> Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently
> while a clustering service C reclusters them into f4, f5.
> * Writes can be either updates, deletes or inserts.
> * Either clustering C or the writer W can finish first
> * Both W and C need to be able to complete their actions without much
> redoing of work.
> * The number of output file groups for C can be higher or lower than input
> file groups.
> * Need to work across and be oblivious to whether the writers are operating
> in OCC or NBCC modes
> * Needs to interplay well with cleaning and compaction services.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)