[
https://issues.apache.org/jira/browse/HUDI-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445150#comment-17445150
]
Vinoth Chandar commented on HUDI-2488:
--------------------------------------
>can skip adding delta commits to metadata table by looking at
>`t+30.indexing.requested` if this writer's instant time is < t+30 secs.
I get what you are suggesting here, basically build a buffer between the writer
and indexer, but without a lock, that `if` check above can always be
technically wrong, in a purely asynchronous, distributed world.
>what happens if there is a pending async table service < t+30 that never
>completed even after index building is over
in general pending async table services always complete, by design i.e they can
fail, we retry and are resilient to get it completed. But the case that you
point out, i.e the pending async table services completed their writes to
metadata is an important one to think through. Will vet the final impl against
it.
> will it be part of `CREATE INDEX` trigger while
> `timestamp.indexing.requested` is being planned out? or after the plan is
> serialized? to be specific
Has to be during plan execution. when it goes to inflight. There should not be
any changes done during planning phases in general. Writers and indexer may
need to co-ordinate if out-of-process or multi-writers to ensure they do it
safely. - the reuniting.
>why would a inflight writer abort itself? can you please clarify.
an inflight writer has to ensure it can write index updates to the new MDT
partition, based on indexing plan. Else it has to fail itself, otherwise we
will lose index updates
> , how do they know what partitions in MDT to do synchronous update?
You need to load the timelines. the minute you have a pending indexing update,
you know you need to start writing to your partitions too
> we might have to fix compaction trigger in MDT for this purpose.
I am not sure if we can't. the new or rebuilt MDT partition has to be ignored.
that's all. but that partition cannot be compacted. ofc.
> With multiple partitions in flux,
we have to cross this bridge when we get there. there may be more foundational
changes needed for MDT.
> Support bootstrapping a single or more partitions in metadata table while
> regular writers and table services are in progress
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-2488
> URL: https://issues.apache.org/jira/browse/HUDI-2488
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: sivabalan narayanan
> Assignee: Vinoth Chandar
> Priority: Blocker
> Fix For: 0.10.0
>
>
> For now, we have only FILES partition in metadata table. and our suggestion
> is to stop all processes and then restart one by one by enabling metadata
> table. first process to start back will invoke bootstrapping of the metadata
> table.
>
> But this may not work out well as we add more and more partitions to metadata
> table.
> We need to support bootstrapping a single or more partitions in metadata
> table while regular writers and table services are in progress.
>
>
> Penning down my thoughts/idea.
> I tried to find a way to get this done w/o adding an additional lock, but
> could not crack that. So, here is one way to support async bootstrap.
>
> Introducing a file called "available_partitions" in some special file under
> metadata table. This file will contain the list of partitions that are
> available to apply updates from data table. i.e. when we do synchronous
> updates from data table to metadata table, when we have N no of partitions in
> metadata table, we need to know what partitions are fully bootstrapped and
> ready to take updates. this file will assist in maintaining that info. We can
> debate on how to maintain this info (tbl props, or separate file etc, but for
> now let's say this file is the source of truth). Idea here is that, any async
> bootstrap process will update this file with the new partition that got
> bootstrapped once the bootstrap is fully complete. So that all other writers
> will know what partitions to update.
> Add we need to introduce a metadata_lock as well.
>
> here is how writers and async bootstrap will pan out.
>
> Regular writer or any async table service(compaction, etc):
> when changes are required to be applied to metadata table: // fyi. as of
> today this already happens within data table lock.
> Take metadata_lock
> read contents of available_partitions.
> prep records and apply updates to metadata table.
> release lock.
>
> Async bootstrap process:
> Start bootstrapping of a given partition (eg files) in metadata table.
> do it in a loop. i.e. first iteration of bootstrap could take 10 mins
> for eg. and then again catch up new commits that happened in the last 10 mins
> which could take 1 min for instance. and then again go for another loop.
> Whenever total bootstrap time for a round is ~ 1min or less, in the next
> round, we can go in for final iteration.
> During the final iteration, take the metadata_lock. // this lock
> should not be held for more than few secs.
> apply any new commits that happened while last iteration
> of bootstrap was happening.
> update "available_partitions" file with this partition
> info that got fully bootstrapped.
> release lock.
>
> metadata_lock: will ensure when async bootstrap is in final stages of
> bootstrapping, we should not miss any commits that is nearing completion. So,
> we ought to take a lock to ensure we don't miss out on any commits. Either
> async bootstrap will apply the update, or the actual writer itself will
> update directly if bootstrap is fully complete.
>
> Rgdn "available_partitions":
> I was looking for a way to know what partitions are fully ready to take in
> direct updates from regular writers and hence chose this way. We can also
> think about creating a temp_partition(files_temp or something) while
> bootstrap in progress and then rename to original partition name once
> bootstrap is fully complete. If we can ensure reliably renaming of these
> partitions(i.e, once files partition is available, it is fully ready to take
> in direct updates), we can take this route as well.
> Here is how it might pan out w/ folder/partition renaming.
>
> Regular writer or any async table service(compaction, etc):
> when changes are required to be applied to metadata table: // fyi. as of
> today this already happens within data table lock.
> Take metadata_lock
> list partitions in metadata table. ignore temp partitions.
> prep records and apply updates to metadata table.
> release lock.
>
> Async bootstrap process:
> Start bootstrapping of a given partition (eg files) in metadata table.
> create a temp folder for partition thats getting bootstrapped. (for eg:
> files_temp)
> do it in a loop. i.e. first iteration of bootstrap could take 10 mins
> for eg. and then again catch up new commits that happened in the last 10 mins
> which could take 1 min for instance. and then again go for another loop.
> Whenever total bootstrap time for a round is ~ 1min or less, in the next
> round, we can go in for final iteration.
> During the final iteration, take the metadata_lock. // this lock
> should not be held for more than few secs.
> apply any new commits that happened while last iteration
> of bootstrap was happening.
> rename files_temp to files.
> release lock.
> Note: we just need to ensure that folder renaming is consistent. On crash,
> either new folder is fully intact or not available. contents of old folder
> does not matter.
>
> Failures:
> a. if bootstrap failed midway, until "files" hasn't been created, we can
> delete files_temp and start all over again.
> b. if bootstrap failed just after rename, again we should be good. Just that
> lock may not have been released. We need to ensure the metadata lock is
> released. So, to tackle this, if acquiring metadata_lock from regular writer
> fails, we will just proceed onto listing partitions and applying updates.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)