[
https://issues.apache.org/jira/browse/HUDI-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441192#comment-17441192
]
sivabalan narayanan commented on HUDI-2488:
-------------------------------------------
High level approach looks good to me. Few things to consider with the proposed
design and some suggestions. Some of the points are impl nuances. Just dumping
all my thoughts.
1. When we start `CREATE INDEX`, instead of taking current time `{*}t{*}` we
could do `{*}t+ 30 secs{*}`. So, even if there is a concurrent writer, that
just got started immediately after `CREATE INDEX` is triggered, can skip adding
delta commits to metadata table by looking at `t+30.indexing.requested` if this
writer's instant time is < t+30 secs. Other writers whose instant time is >=
t+30, will do delta commits to Metadata table (will use MDT as abbreviation
here after for Metadata table). with this, we can ensure there won't be any
gaps and no writer will miss out to make updates. either INDEX building will
take care of syncing to MDT partition or the writer will do sync updates to MDT
partition.
2. Need some clarification wrt the statement "without any "holes" i.e pending
async operations prior to it. ". Also, in general, want to understand, what
happens if there is a pending async table service < t+30 that never completed
even after index building is over? Will index building keep waiting for it to
complete? can you help me understand what happens here.
3. When exactly different file groups will be instantiated for a MDT partition?
will it be part of `CREATE INDEX` trigger while `timestamp.indexing.requested`
is being planned out? or after the plan is serialized? to be specific,
immediately after `CREATE INDEX`, lets say we have a new writer which is
supposed to do delta commits to MDT partition. Will be file groups be
instantiated already? or does this writer need to do any special handling of
instantiating the file groups?
4. For regular writers, how do they know what partitions in MDT to do
synchronous update? my understanding is that, it will list partitions in MDT
and then lists data timeline and finds the difference to find all fully
bootstrapped (or built out) partitions in MDT and then goes on to do prep
records. so, this involves reloading the data timeline every time a writer to
about to apply updates to MDT? Or is there a way out w/ reloading the data
timeline.
5. When index building is in progress, guess we can't trigger any compaction or
cleaning in MDT atleast for the partitions being built out. we might have to
fix compaction trigger in MDT for this purpose. I am talking about a scenario
where index building itself takes 10 mins or so, and so many delta commits
piled up in MDT partition during this time.
6. applying rollback to MDT has to be revisited. it has some dependency on last
compacted time in MDT or the base file instant time. So, with this new
approach, we have to revisit the logic.
7. Also, archival in dataset has to be revisited. As of now, it is fenced by
last compacted time in MDT. With multiple partitions in flux, (lets say FILES
partition started indexing building at t50, bloom partition started index
building at t100 etc) how does archival in dataset get impacted. With the new
approach, I see there is a chance that archival will not be fenced anymore? but
need to think more on this.
> Support bootstrapping a single or more partitions in metadata table while
> regular writers and table services are in progress
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-2488
> URL: https://issues.apache.org/jira/browse/HUDI-2488
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: sivabalan narayanan
> Assignee: Vinoth Chandar
> Priority: Blocker
> Fix For: 0.10.0
>
>
> For now, we have only FILES partition in metadata table. and our suggestion
> is to stop all processes and then restart one by one by enabling metadata
> table. first process to start back will invoke bootstrapping of the metadata
> table.
>
> But this may not work out well as we add more and more partitions to metadata
> table.
> We need to support bootstrapping a single or more partitions in metadata
> table while regular writers and table services are in progress.
>
>
> Penning down my thoughts/idea.
> I tried to find a way to get this done w/o adding an additional lock, but
> could not crack that. So, here is one way to support async bootstrap.
>
> Introducing a file called "available_partitions" in some special file under
> metadata table. This file will contain the list of partitions that are
> available to apply updates from data table. i.e. when we do synchronous
> updates from data table to metadata table, when we have N no of partitions in
> metadata table, we need to know what partitions are fully bootstrapped and
> ready to take updates. this file will assist in maintaining that info. We can
> debate on how to maintain this info (tbl props, or separate file etc, but for
> now let's say this file is the source of truth). Idea here is that, any async
> bootstrap process will update this file with the new partition that got
> bootstrapped once the bootstrap is fully complete. So that all other writers
> will know what partitions to update.
> Add we need to introduce a metadata_lock as well.
>
> here is how writers and async bootstrap will pan out.
>
> Regular writer or any async table service(compaction, etc):
> when changes are required to be applied to metadata table: // fyi. as of
> today this already happens within data table lock.
> Take metadata_lock
> read contents of available_partitions.
> prep records and apply updates to metadata table.
> release lock.
>
> Async bootstrap process:
> Start bootstrapping of a given partition (eg files) in metadata table.
> do it in a loop. i.e. first iteration of bootstrap could take 10 mins
> for eg. and then again catch up new commits that happened in the last 10 mins
> which could take 1 min for instance. and then again go for another loop.
> Whenever total bootstrap time for a round is ~ 1min or less, in the next
> round, we can go in for final iteration.
> During the final iteration, take the metadata_lock. // this lock
> should not be held for more than few secs.
> apply any new commits that happened while last iteration
> of bootstrap was happening.
> update "available_partitions" file with this partition
> info that got fully bootstrapped.
> release lock.
>
> metadata_lock: will ensure when async bootstrap is in final stages of
> bootstrapping, we should not miss any commits that is nearing completion. So,
> we ought to take a lock to ensure we don't miss out on any commits. Either
> async bootstrap will apply the update, or the actual writer itself will
> update directly if bootstrap is fully complete.
>
> Rgdn "available_partitions":
> I was looking for a way to know what partitions are fully ready to take in
> direct updates from regular writers and hence chose this way. We can also
> think about creating a temp_partition(files_temp or something) while
> bootstrap in progress and then rename to original partition name once
> bootstrap is fully complete. If we can ensure reliably renaming of these
> partitions(i.e, once files partition is available, it is fully ready to take
> in direct updates), we can take this route as well.
> Here is how it might pan out w/ folder/partition renaming.
>
> Regular writer or any async table service(compaction, etc):
> when changes are required to be applied to metadata table: // fyi. as of
> today this already happens within data table lock.
> Take metadata_lock
> list partitions in metadata table. ignore temp partitions.
> prep records and apply updates to metadata table.
> release lock.
>
> Async bootstrap process:
> Start bootstrapping of a given partition (eg files) in metadata table.
> create a temp folder for partition thats getting bootstrapped. (for eg:
> files_temp)
> do it in a loop. i.e. first iteration of bootstrap could take 10 mins
> for eg. and then again catch up new commits that happened in the last 10 mins
> which could take 1 min for instance. and then again go for another loop.
> Whenever total bootstrap time for a round is ~ 1min or less, in the next
> round, we can go in for final iteration.
> During the final iteration, take the metadata_lock. // this lock
> should not be held for more than few secs.
> apply any new commits that happened while last iteration
> of bootstrap was happening.
> rename files_temp to files.
> release lock.
> Note: we just need to ensure that folder renaming is consistent. On crash,
> either new folder is fully intact or not available. contents of old folder
> does not matter.
>
> Failures:
> a. if bootstrap failed midway, until "files" hasn't been created, we can
> delete files_temp and start all over again.
> b. if bootstrap failed just after rename, again we should be good. Just that
> lock may not have been released. We need to ensure the metadata lock is
> released. So, to tackle this, if acquiring metadata_lock from regular writer
> fails, we will just proceed onto listing partitions and applying updates.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)