[
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842466#comment-17842466
]
Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:01 PM:
---------------------------------------------------------------
h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format
if we truly wish to achieve, independent operations of C and W, while
minimizing the amount of work redone on the writer side, we need to introduce a
notion of "pointer data blocks" (name TBD) in Hudi's log format.
*Pointer data blocks*
A pointer data block just keeps pointers to other blocks in a different file
groups.
{code:java}
pointer data block {
[
{fg1, logfileX, ..},
{fg2, logfileY, ..},
{fg3, logfileX, ..}
]
}
{code}
In this approach, instead of redistributing the records from
was (Author: vc):
h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format
if we truly wish to achieve, independent operations of C and W, without
> Support updates during clustering
> ---------------------------------
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
> Issue Type: Task
> Components: clustering, table-service
> Reporter: leesf
> Assignee: Vinoth Chandar
> Priority: Blocker
> Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3,
> concurrently while a clustering service C reclusters them into f4, f5.
> Goals
> * Writes can be either updates, deletes or inserts.
> * Either clustering C or the writer W can finish first
> * Both W and C need to be able to complete their actions without much
> redoing of work.
> * The number of output file groups for C can be higher or lower than input
> file groups.
> * Need to work across and be oblivious to whether the writers are operating
> in OCC or NBCC modes
> * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals
> * Strictly the sort order achieved by clustering, in face of updates (e.g
> updates change clustering field values, causing output clustering file groups
> to be not fully sorted by those fields)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)