nsivabalan commented on code in PR #11555: URL: https://github.com/apache/hudi/pull/11555#discussion_r1749333846
########## rfc/rfc-79/rfc-79.md: ########## @@ -0,0 +1,116 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-79: Improving reliability of concurrent table service executions and rollbacks + +## Proposers + +- @kbuci +- @suryaprasanna +- @nsivabalan + +## Approvers + +## Status + +JIRA: HUDI-7946 + + +## Abstract +In order to improve latency/throughput of writes into a HUDI dataset, HUDI does not require that table service operations (such as clustering and compaction) be serially and sequentially performed before/after an ingestion write. Instead, by enabling HUDI multiwriter and async table service execution, a user can orchesterate seperate writers to potentially execute table service plans concurrently to an ingestion writers. Even though HUDI has multi writer capability since 0.7, these multiwriter semantics focused on ensuring correct execution of concurrent ingestion writers; not interleaving operations of ingestion writes, table services mad rollbacks. To expand the capability of HUDI to assist users to manage several table services for thousands of tables, we might need add support to ensure concurrent workers can reliably execute multiple table services for a dataset. +Once we have this support, it should be feasible to build orchestration for all HUDI tables in a data-lake using a centralized framework. This RFC proposes to add such capability, where any table service worker can perform table services safely alongside other table service workers. + + +## Background +### Multiwriter and rollbacks +HUDI supports a transaction manager, which allows a job to take a table-level exclusive lock. Additionally, HUDI supports a heartbeating mechanism https://hudi.apache.org/docs/next/rollbacks/#heartbeats , where an ingestion writer will generate a heartbeat while writing an instant, and the heartbeat will be cleaned up or expired if the instant commits or fails. If the HUDI Clean operation detects an incomplete instant with a non-active heartbeat, it will perform a rollback of said instant. A rollback of https://hudi.apache.org/docs/next/rollbacks/#rolling-back-of-partially-failed-commits-w-multi-writers of a failed ingestion instant involves scheduling a rollback plan (if one doesn't exist already) for the instant and executing it, deleting any data files written, and then finally removing the instant's files in the active timeline. Creating a plan and following this ordering ensures that even if a writer failed while performing a rollback, the next attempt will correctly resume the rollback. This RFC will explore applying some of these semantics to table service operations. Review Comment: here is a suggestion to re-writer this entire para ``` When multi-writer support was added to Hudi, we had to introduce certain constructs in code to support the feature. A transaction manager was added to assist with acquiring and releasing locks. Heartbeat mechanism was introduced to track the liveness of a transaction. If the heartbest expired and the commit of interest is not completed, we can detect the commit to have failed. If not, the commit of interest could still be making process when looked from a lens of a different writer. Rollbacks in singler writer mode is straight forward as any pending instant in timeline is deduced to have been failed and is eligible to be rolledback. In case of multi-writers, heartbest emitted by the actual writer is used for health check purposes. Hudi detects all such incomplete instants with a non-active heartbeat, and will perform rollbacks of all (heartbeat) expired instants. A rollback of https://hudi.apache.org/docs/next/rollbacks/#rolling-back-of-partially-failed-commits-w-multi-writers of a failed ingestion instant involves scheduling a rollback plan (if one doesn't exist already) for the instant and executing it, deleting any data files written, and then finally removing the instant's files in the active timeline. Creating a plan and following this ordering ensures that even if a writer crashed mid-way during a rollback, the next attempt will correctly resume the rollback and take it to completion. This RFC will explore applying some of these semantics to table servi ce operations. ``` Again, feel free to add/edit as you see fit. ########## rfc/rfc-79/rfc-79-2.md: ########## @@ -0,0 +1,99 @@ +w<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# Add support for cancellable table service plans + +## Proposers + + +## Approvers + +## Status + +JIRA: HUDI-7946 + + +## Abstract +Table service plans can delay ingestion writes from updating a dataset with recent data if potential write conflicts are detected. Furthermore, a table service plan that isn't executed to completion for a large amount of time (due to repeated failures, application misconfiguration, or insufficient resources) will degrade the read/write performance of a dataset due to delaying clean, archival, and metadata table compaction. This is because currently HUDI table service plans, upon being scheduled, must be executed to completion. And additonally will prevent any ingestion write targeting the same files from succeeding (due to posing as a write conflict) as well as prevent any new table service plan from targeting the same files. Enabling a user to configure a table service plan as "cancellable" can prevent frequent or repeatedly failing table service plans from delaying ingestion. Support for cancellable plans will provide HUDI an avenue to fully cancel a table service plan and allow o ther table service to proceed. + + +## Background +### Execution of table services +The table service operations compact and cluster are by default "immutable" plans, meaning that once a plan is scheduled it will stay as as a pending instant until a caller invokes the table service execute API on the table service instant and sucessfully completes it. Specifically, if an inflight execution fails after transitioning the instant to inflight, the next execution attempt will implictly create and execute a rollback plan (which will delete all new instant/data files), but will keep the table service plan. This process will repeat until the instant is completed. The below visualization captures these transitions at a high level + + + +## Clean and rollback of failed writes +The clean table service, in addition to performing a clean action, is responsible for rolling back any failed ingestion writes (non-clustering/non-compaction inflight instants that are not being executed by a writer). This means that table services plans are not currently subject to clean. As detailed below, this proposal for supporting cancellable table service will require enabling clean be capable of targeting table service plans. + +## Goals +### (1) A cancellable plan should be pre-empted by other writers +The current requirement of HUDI needing to execute a table service plan to completion forces ingestion writers to abort a commit if a table service plan is conflicting. Becuase an ingestion writer typically determines the exact file groups it will be updating/replacing after building a workload profile and performing record tagging, the writer may have already spent a lot of time and resources before realizing that it needs to abort. In the face of frequent table service plans or an old inflight plan, this will cause delays in adding recent upstream records to the dataset as well as unecessairly take away resources (such as Spark executors in the case of Spark engine) from other applications in the data lake. A cancellable table service plan should avoid this situation by preventing itself from being comitted if a conflicting ingestion job has been comitted already. In conjunction, any ingestion writer or non-cancellable table service writer should be able to infer that a conflictin g inflight table service plan is cancellable, and therefore can be ignored when attempting to commit the instant. + +### (2) An inflight cancellable plan should be automatically cleaned up +Another consequence of this existing table service flow is that a table service plan cannot be subject to clean's rollback of failed writes. Clean typically performs a rollback of inflight instants that are no longer being progressed by a writer (and have an inactive heartbeat). Because table service plans needed to be executed to completion and don't have an active heartbeat these inflight plans cannot be subject to this cleanup. Because an inflight plan remaining on the timeline can degrade performance of reads/writes (as mentioned earlier), a cancellable table service plan should be elligible to be targeted for cleanup if HUDI clean deems that it has remaining inflight for too long (or some other critera). Note that a failed table service should still be able to be safely cleaned up immeditaley - the goal here is just to make sure an inflight plan won't stay on the timeline for an unbounded amount of time but also won't be likely to be prematurely cleaned up by clean before it ha s a chance to be executed. + +## Design +### Enabling a plan to be pre-emptable +To satisfy goal (1), a new config flag "cancellable" can be added to a table service plan. A writer that intends to schedule a cancellable table service plan can enable the flag in the serialized plan metadata. Any writer executing the plan can infer that the plan is cancellable, and when trying to commit the instant should abort if it detects that any ingestion write or table service plan (without cancellable config flag) is targeting the same file groups. On the other side, the commit finalization flow for ingestion writers can be updated to ignore any inflight table service plans if they are cancellable. +For the purpose of this design proposal, consider an ingestion job as having three steps: +1. Schedule itself on the timeline with a new instant time in a .requested file +2. Process/record tag incoming records, build a workload profile, and write the updating/replaced file groups to a "inflight" instant file on the timeline. Check for conflicts and abort if needed. +3. Perform write conflict checks and commit the instant on the timeline + +The aforementioned changes to ingestion and table service flow will ensure that in the event of a conflicting ingestion and cancellable table service writer, the ingestion job will take precedence unless the table service job was completed before (2). Since in this scenario the ingestion job will see that a completed instant (a cancellable table service action) conflicts with its ongoing inflight write, and therefore it would not be legal to proceed. Unfourtatnely this means that this design cannot compeletly guarantee that ingestion job will always take precedence. But future enhancements/hueristics can be explored to descrease the chance of this scenario, such as +* Have the ingestion writer write a "hint" of possible partitions it might affect in the .requested file, and the cancellable table service writer can check that before commiting the table service plan +* If the cancellable table service writer sees that there is a .requested file for an ingestion action, it can try to wait some time for the .inflight to appear before performing write reconcilation checks Review Comment: I don't think we can go w/ this solution. This is not bounded and is only probabilistic. ########## rfc/rfc-79/rfc-79.md: ########## @@ -0,0 +1,116 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-79: Improving reliability of concurrent table service executions and rollbacks + +## Proposers + +- @kbuci +- @suryaprasanna +- @nsivabalan + +## Approvers + +## Status + +JIRA: HUDI-7946 + + +## Abstract Review Comment: here is another way to write the abstract. Feel free to add/edit as you see fit. ``` Users are always looking for ways to speed up the write latencies and higher throughput to achieve near real time ingestion. Hudi introduced multi-writers in 0.7.0 to support concurrent writes to improve the thoughout of the Hudi writer for non-overlapping writers. But major focus was for multiple ingestion writers and is recommended to have table services delegated to just 1 of the writer to avoid contention. But as always we are looking to push boundaries and looking to see if we can support multiple table service writers can operate safely and reliably from different processes to optimize the table for better file sizing and better queries latencies. Once we have this support, it should be feasible to build orchestration for all HUDI tables in a data-lake, using a centralized framework. This RFC proposes to add such capability, where any table service worker can perform table services safely alongside other table service workers. ``` Do you think this is concise and conveys the intent of the RFC ? ########## rfc/rfc-79/rfc-79.md: ########## @@ -0,0 +1,116 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-79: Improving reliability of concurrent table service executions and rollbacks + +## Proposers + +- @kbuci +- @suryaprasanna +- @nsivabalan + +## Approvers + +## Status + +JIRA: HUDI-7946 + + +## Abstract +In order to improve latency/throughput of writes into a HUDI dataset, HUDI does not require that table service operations (such as clustering and compaction) be serially and sequentially performed before/after an ingestion write. Instead, by enabling HUDI multiwriter and async table service execution, a user can orchesterate seperate writers to potentially execute table service plans concurrently to an ingestion writers. Even though HUDI has multi writer capability since 0.7, these multiwriter semantics focused on ensuring correct execution of concurrent ingestion writers; not interleaving operations of ingestion writes, table services mad rollbacks. To expand the capability of HUDI to assist users to manage several table services for thousands of tables, we might need add support to ensure concurrent workers can reliably execute multiple table services for a dataset. +Once we have this support, it should be feasible to build orchestration for all HUDI tables in a data-lake using a centralized framework. This RFC proposes to add such capability, where any table service worker can perform table services safely alongside other table service workers. + + +## Background +### Multiwriter and rollbacks +HUDI supports a transaction manager, which allows a job to take a table-level exclusive lock. Additionally, HUDI supports a heartbeating mechanism https://hudi.apache.org/docs/next/rollbacks/#heartbeats , where an ingestion writer will generate a heartbeat while writing an instant, and the heartbeat will be cleaned up or expired if the instant commits or fails. If the HUDI Clean operation detects an incomplete instant with a non-active heartbeat, it will perform a rollback of said instant. A rollback of https://hudi.apache.org/docs/next/rollbacks/#rolling-back-of-partially-failed-commits-w-multi-writers of a failed ingestion instant involves scheduling a rollback plan (if one doesn't exist already) for the instant and executing it, deleting any data files written, and then finally removing the instant's files in the active timeline. Creating a plan and following this ordering ensures that even if a writer failed while performing a rollback, the next attempt will correctly resume the rollback. This RFC will explore applying some of these semantics to table service operations. + +### Execution of table services Review Comment: here is my version of content. ``` Re-cap of Objective: Support multiple writers be able to schedule and execute table services either using native hudi writers with inline deploment models or using a separate orchestrator which can manage scheduling and execution of table services for N Hudi tables. Challenges: As of now, all table service plans in Apache Hudi are immutable in nature. In other words, once a plan is generated and serialized to disk, it has to be taken to completion and can't be aborted w/o executing an admin operation (like hudi cli). But we have another RFC in draft which aims to support Mutable/Cancellable table service plans. So, lets first take a look at challenges in supporting multiple table service executions for immutable plans and later we can discuss about mutable/cancellable ones. Immutable table service plans: Happy path involves, generating a plan and serializing to requested instant to timeline, and then executing the plan by moving the state to inflight and finally wrapping up the table service by moving to completion state in the timeline. If an execution fails after transitioning to inflight, the next execution attempt will rollback the 1st attempt (which will delete all data files created with the 1st attempt), but will keep the table service plan. And then the execution will be attempted. This process will repeat until the table service is complete. The below visualization captures these transitions at a high level. For immutable plans, challenge to support multiple concurrent executions of table service, is around deducing if some other writer is currently executing or rolling back a generated plan. Without heart beats, we can never know if there is a concurrent writer working on a scheduled table serive instant. So, all we need to do here is to introduce heart beats during execution of table services. So, a concurrent writer on detecting the heartbeat will bail out knowing some other writer is working on a given table service instant. If there are no heart beats, current writer acquires the lock and starts emitting heart beat and proceeds on to executing the table service. So, this will ensure atmost only one writer can operate/execute a given table service plan at any point in time. Obviously table level lock has to be acquired while checking for heart beat expiring and when starting to emit heart beats to manage the critical section. So, w/ adding heart beats to table service executions, we can support multiple writers to execute table services reliably. ``` If possible, we can draw some diagrams to illustrate the failure scenarios. ########## rfc/rfc-79/rfc-79-2.md: ########## @@ -0,0 +1,99 @@ +w<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# Add support for cancellable table service plans + +## Proposers + + +## Approvers + +## Status + +JIRA: HUDI-7946 + + +## Abstract +Table service plans can delay ingestion writes from updating a dataset with recent data if potential write conflicts are detected. Furthermore, a table service plan that isn't executed to completion for a large amount of time (due to repeated failures, application misconfiguration, or insufficient resources) will degrade the read/write performance of a dataset due to delaying clean, archival, and metadata table compaction. This is because currently HUDI table service plans, upon being scheduled, must be executed to completion. And additonally will prevent any ingestion write targeting the same files from succeeding (due to posing as a write conflict) as well as prevent any new table service plan from targeting the same files. Enabling a user to configure a table service plan as "cancellable" can prevent frequent or repeatedly failing table service plans from delaying ingestion. Support for cancellable plans will provide HUDI an avenue to fully cancel a table service plan and allow o ther table service to proceed. + + +## Background +### Execution of table services +The table service operations compact and cluster are by default "immutable" plans, meaning that once a plan is scheduled it will stay as as a pending instant until a caller invokes the table service execute API on the table service instant and sucessfully completes it. Specifically, if an inflight execution fails after transitioning the instant to inflight, the next execution attempt will implictly create and execute a rollback plan (which will delete all new instant/data files), but will keep the table service plan. This process will repeat until the instant is completed. The below visualization captures these transitions at a high level + + + +## Clean and rollback of failed writes +The clean table service, in addition to performing a clean action, is responsible for rolling back any failed ingestion writes (non-clustering/non-compaction inflight instants that are not being executed by a writer). This means that table services plans are not currently subject to clean. As detailed below, this proposal for supporting cancellable table service will require enabling clean be capable of targeting table service plans. + +## Goals +### (1) A cancellable plan should be pre-empted by other writers +The current requirement of HUDI needing to execute a table service plan to completion forces ingestion writers to abort a commit if a table service plan is conflicting. Becuase an ingestion writer typically determines the exact file groups it will be updating/replacing after building a workload profile and performing record tagging, the writer may have already spent a lot of time and resources before realizing that it needs to abort. In the face of frequent table service plans or an old inflight plan, this will cause delays in adding recent upstream records to the dataset as well as unecessairly take away resources (such as Spark executors in the case of Spark engine) from other applications in the data lake. A cancellable table service plan should avoid this situation by preventing itself from being comitted if a conflicting ingestion job has been comitted already. In conjunction, any ingestion writer or non-cancellable table service writer should be able to infer that a conflictin g inflight table service plan is cancellable, and therefore can be ignored when attempting to commit the instant. + +### (2) An inflight cancellable plan should be automatically cleaned up +Another consequence of this existing table service flow is that a table service plan cannot be subject to clean's rollback of failed writes. Clean typically performs a rollback of inflight instants that are no longer being progressed by a writer (and have an inactive heartbeat). Because table service plans needed to be executed to completion and don't have an active heartbeat these inflight plans cannot be subject to this cleanup. Because an inflight plan remaining on the timeline can degrade performance of reads/writes (as mentioned earlier), a cancellable table service plan should be elligible to be targeted for cleanup if HUDI clean deems that it has remaining inflight for too long (or some other critera). Note that a failed table service should still be able to be safely cleaned up immeditaley - the goal here is just to make sure an inflight plan won't stay on the timeline for an unbounded amount of time but also won't be likely to be prematurely cleaned up by clean before it ha s a chance to be executed. + +## Design +### Enabling a plan to be pre-emptable +To satisfy goal (1), a new config flag "cancellable" can be added to a table service plan. A writer that intends to schedule a cancellable table service plan can enable the flag in the serialized plan metadata. Any writer executing the plan can infer that the plan is cancellable, and when trying to commit the instant should abort if it detects that any ingestion write or table service plan (without cancellable config flag) is targeting the same file groups. On the other side, the commit finalization flow for ingestion writers can be updated to ignore any inflight table service plans if they are cancellable. +For the purpose of this design proposal, consider an ingestion job as having three steps: +1. Schedule itself on the timeline with a new instant time in a .requested file +2. Process/record tag incoming records, build a workload profile, and write the updating/replaced file groups to a "inflight" instant file on the timeline. Check for conflicts and abort if needed. +3. Perform write conflict checks and commit the instant on the timeline + +The aforementioned changes to ingestion and table service flow will ensure that in the event of a conflicting ingestion and cancellable table service writer, the ingestion job will take precedence unless the table service job was completed before (2). Since in this scenario the ingestion job will see that a completed instant (a cancellable table service action) conflicts with its ongoing inflight write, and therefore it would not be legal to proceed. Unfourtatnely this means that this design cannot compeletly guarantee that ingestion job will always take precedence. But future enhancements/hueristics can be explored to descrease the chance of this scenario, such as +* Have the ingestion writer write a "hint" of possible partitions it might affect in the .requested file, and the cancellable table service writer can check that before commiting the table service plan +* If the cancellable table service writer sees that there is a .requested file for an ingestion action, it can try to wait some time for the .inflight to appear before performing write reconcilation checks + +### Handling cancellation of plans +An additional config "cancellation-policy" can be added to the table service plan to indicate when it is ellgible to be permenatnly rolled back by writers other than the one responsbible for executing the table service. This policy can be a threshold of hours or instants on timeline, where if that # of hours/instants have elapsed since the plan was scheduled, any writer/operation can target it for rollback via clean. This policy should be configured by the writer scheduling a cacnellable table service, based on the amount of time they expect the plan to remain on the timeline before being picked up for execution. For example, if a table service writer is expected to immeditately start executing the plan after scheduling it, the the cancellation-policy can just be a few minutes. On the other hand, if the plan is expected to have its execution deferred to a few hours later, then the cancellation-policy should be more lenient. Note that this cancellation policy is not a repalacement fo r determining wether a table service plan is currently being executed - as wtih ingestion writes, cleanup of a cancellable table service plan should only start once it is confirmed that a ongoing writer is no longer progressing it. + +In order to ensure that other writers can indeed permenantely cancel a cancellable table service plan (such that it can no longer be executed), additional changes to clean and table service flow will be need to be added as well. Two proposals are detailed below. Also, note that the cancellation-policy is only required to be honored by clean: a user can choose setup an application to aggresively clean up a failed cancellable table service plan even if it has not meet the critera for its cancellation-policy yet. This can be useful if a user wants a utility to manually ensure that clean/archival for a dataset progresses immdeitately or knows that a cancellable table service plan will not be attempted again or cleaned up by another writer. Each proposal provides an example on how to achieve this. + +#### (A) Making cancellable plans "mutable" Review Comment: I guess we are not calling out another drawback of this approach. a concurrent table service writer, could encounter FileNotFound issue while accessing the timeline if another writer rollsback the table service of interest and by removing all meta files from the timeline. Don't we need to handle this ? ########## rfc/rfc-79/rfc-79-2.md: ########## @@ -0,0 +1,99 @@ +w<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# Add support for cancellable table service plans + +## Proposers + + +## Approvers + +## Status + +JIRA: HUDI-7946 + + +## Abstract +Table service plans can delay ingestion writes from updating a dataset with recent data if potential write conflicts are detected. Furthermore, a table service plan that isn't executed to completion for a large amount of time (due to repeated failures, application misconfiguration, or insufficient resources) will degrade the read/write performance of a dataset due to delaying clean, archival, and metadata table compaction. This is because currently HUDI table service plans, upon being scheduled, must be executed to completion. And additonally will prevent any ingestion write targeting the same files from succeeding (due to posing as a write conflict) as well as prevent any new table service plan from targeting the same files. Enabling a user to configure a table service plan as "cancellable" can prevent frequent or repeatedly failing table service plans from delaying ingestion. Support for cancellable plans will provide HUDI an avenue to fully cancel a table service plan and allow o ther table service to proceed. + + +## Background +### Execution of table services +The table service operations compact and cluster are by default "immutable" plans, meaning that once a plan is scheduled it will stay as as a pending instant until a caller invokes the table service execute API on the table service instant and sucessfully completes it. Specifically, if an inflight execution fails after transitioning the instant to inflight, the next execution attempt will implictly create and execute a rollback plan (which will delete all new instant/data files), but will keep the table service plan. This process will repeat until the instant is completed. The below visualization captures these transitions at a high level + + + +## Clean and rollback of failed writes +The clean table service, in addition to performing a clean action, is responsible for rolling back any failed ingestion writes (non-clustering/non-compaction inflight instants that are not being executed by a writer). This means that table services plans are not currently subject to clean. As detailed below, this proposal for supporting cancellable table service will require enabling clean be capable of targeting table service plans. + +## Goals +### (1) A cancellable plan should be pre-empted by other writers +The current requirement of HUDI needing to execute a table service plan to completion forces ingestion writers to abort a commit if a table service plan is conflicting. Becuase an ingestion writer typically determines the exact file groups it will be updating/replacing after building a workload profile and performing record tagging, the writer may have already spent a lot of time and resources before realizing that it needs to abort. In the face of frequent table service plans or an old inflight plan, this will cause delays in adding recent upstream records to the dataset as well as unecessairly take away resources (such as Spark executors in the case of Spark engine) from other applications in the data lake. A cancellable table service plan should avoid this situation by preventing itself from being comitted if a conflicting ingestion job has been comitted already. In conjunction, any ingestion writer or non-cancellable table service writer should be able to infer that a conflictin g inflight table service plan is cancellable, and therefore can be ignored when attempting to commit the instant. + +### (2) An inflight cancellable plan should be automatically cleaned up +Another consequence of this existing table service flow is that a table service plan cannot be subject to clean's rollback of failed writes. Clean typically performs a rollback of inflight instants that are no longer being progressed by a writer (and have an inactive heartbeat). Because table service plans needed to be executed to completion and don't have an active heartbeat these inflight plans cannot be subject to this cleanup. Because an inflight plan remaining on the timeline can degrade performance of reads/writes (as mentioned earlier), a cancellable table service plan should be elligible to be targeted for cleanup if HUDI clean deems that it has remaining inflight for too long (or some other critera). Note that a failed table service should still be able to be safely cleaned up immeditaley - the goal here is just to make sure an inflight plan won't stay on the timeline for an unbounded amount of time but also won't be likely to be prematurely cleaned up by clean before it ha s a chance to be executed. + +## Design +### Enabling a plan to be pre-emptable +To satisfy goal (1), a new config flag "cancellable" can be added to a table service plan. A writer that intends to schedule a cancellable table service plan can enable the flag in the serialized plan metadata. Any writer executing the plan can infer that the plan is cancellable, and when trying to commit the instant should abort if it detects that any ingestion write or table service plan (without cancellable config flag) is targeting the same file groups. On the other side, the commit finalization flow for ingestion writers can be updated to ignore any inflight table service plans if they are cancellable. +For the purpose of this design proposal, consider an ingestion job as having three steps: +1. Schedule itself on the timeline with a new instant time in a .requested file +2. Process/record tag incoming records, build a workload profile, and write the updating/replaced file groups to a "inflight" instant file on the timeline. Check for conflicts and abort if needed. +3. Perform write conflict checks and commit the instant on the timeline + +The aforementioned changes to ingestion and table service flow will ensure that in the event of a conflicting ingestion and cancellable table service writer, the ingestion job will take precedence unless the table service job was completed before (2). Since in this scenario the ingestion job will see that a completed instant (a cancellable table service action) conflicts with its ongoing inflight write, and therefore it would not be legal to proceed. Unfourtatnely this means that this design cannot compeletly guarantee that ingestion job will always take precedence. But future enhancements/hueristics can be explored to descrease the chance of this scenario, such as +* Have the ingestion writer write a "hint" of possible partitions it might affect in the .requested file, and the cancellable table service writer can check that before commiting the table service plan Review Comment: We already have early conflict detection support. we can re-use some of the logic from there and not re-invent the wheel ########## rfc/rfc-79/rfc-79-2.md: ########## @@ -0,0 +1,99 @@ +w<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# Add support for cancellable table service plans + +## Proposers + + +## Approvers + +## Status + +JIRA: HUDI-7946 + + +## Abstract +Table service plans can delay ingestion writes from updating a dataset with recent data if potential write conflicts are detected. Furthermore, a table service plan that isn't executed to completion for a large amount of time (due to repeated failures, application misconfiguration, or insufficient resources) will degrade the read/write performance of a dataset due to delaying clean, archival, and metadata table compaction. This is because currently HUDI table service plans, upon being scheduled, must be executed to completion. And additonally will prevent any ingestion write targeting the same files from succeeding (due to posing as a write conflict) as well as prevent any new table service plan from targeting the same files. Enabling a user to configure a table service plan as "cancellable" can prevent frequent or repeatedly failing table service plans from delaying ingestion. Support for cancellable plans will provide HUDI an avenue to fully cancel a table service plan and allow o ther table service to proceed. + + +## Background +### Execution of table services +The table service operations compact and cluster are by default "immutable" plans, meaning that once a plan is scheduled it will stay as as a pending instant until a caller invokes the table service execute API on the table service instant and sucessfully completes it. Specifically, if an inflight execution fails after transitioning the instant to inflight, the next execution attempt will implictly create and execute a rollback plan (which will delete all new instant/data files), but will keep the table service plan. This process will repeat until the instant is completed. The below visualization captures these transitions at a high level + + + +## Clean and rollback of failed writes +The clean table service, in addition to performing a clean action, is responsible for rolling back any failed ingestion writes (non-clustering/non-compaction inflight instants that are not being executed by a writer). This means that table services plans are not currently subject to clean. As detailed below, this proposal for supporting cancellable table service will require enabling clean be capable of targeting table service plans. + +## Goals +### (1) A cancellable plan should be pre-empted by other writers +The current requirement of HUDI needing to execute a table service plan to completion forces ingestion writers to abort a commit if a table service plan is conflicting. Becuase an ingestion writer typically determines the exact file groups it will be updating/replacing after building a workload profile and performing record tagging, the writer may have already spent a lot of time and resources before realizing that it needs to abort. In the face of frequent table service plans or an old inflight plan, this will cause delays in adding recent upstream records to the dataset as well as unecessairly take away resources (such as Spark executors in the case of Spark engine) from other applications in the data lake. A cancellable table service plan should avoid this situation by preventing itself from being comitted if a conflicting ingestion job has been comitted already. In conjunction, any ingestion writer or non-cancellable table service writer should be able to infer that a conflictin g inflight table service plan is cancellable, and therefore can be ignored when attempting to commit the instant. + +### (2) An inflight cancellable plan should be automatically cleaned up +Another consequence of this existing table service flow is that a table service plan cannot be subject to clean's rollback of failed writes. Clean typically performs a rollback of inflight instants that are no longer being progressed by a writer (and have an inactive heartbeat). Because table service plans needed to be executed to completion and don't have an active heartbeat these inflight plans cannot be subject to this cleanup. Because an inflight plan remaining on the timeline can degrade performance of reads/writes (as mentioned earlier), a cancellable table service plan should be elligible to be targeted for cleanup if HUDI clean deems that it has remaining inflight for too long (or some other critera). Note that a failed table service should still be able to be safely cleaned up immeditaley - the goal here is just to make sure an inflight plan won't stay on the timeline for an unbounded amount of time but also won't be likely to be prematurely cleaned up by clean before it ha s a chance to be executed. + +## Design +### Enabling a plan to be pre-emptable +To satisfy goal (1), a new config flag "cancellable" can be added to a table service plan. A writer that intends to schedule a cancellable table service plan can enable the flag in the serialized plan metadata. Any writer executing the plan can infer that the plan is cancellable, and when trying to commit the instant should abort if it detects that any ingestion write or table service plan (without cancellable config flag) is targeting the same file groups. On the other side, the commit finalization flow for ingestion writers can be updated to ignore any inflight table service plans if they are cancellable. +For the purpose of this design proposal, consider an ingestion job as having three steps: +1. Schedule itself on the timeline with a new instant time in a .requested file +2. Process/record tag incoming records, build a workload profile, and write the updating/replaced file groups to a "inflight" instant file on the timeline. Check for conflicts and abort if needed. +3. Perform write conflict checks and commit the instant on the timeline + +The aforementioned changes to ingestion and table service flow will ensure that in the event of a conflicting ingestion and cancellable table service writer, the ingestion job will take precedence unless the table service job was completed before (2). Since in this scenario the ingestion job will see that a completed instant (a cancellable table service action) conflicts with its ongoing inflight write, and therefore it would not be legal to proceed. Unfourtatnely this means that this design cannot compeletly guarantee that ingestion job will always take precedence. But future enhancements/hueristics can be explored to descrease the chance of this scenario, such as +* Have the ingestion writer write a "hint" of possible partitions it might affect in the .requested file, and the cancellable table service writer can check that before commiting the table service plan +* If the cancellable table service writer sees that there is a .requested file for an ingestion action, it can try to wait some time for the .inflight to appear before performing write reconcilation checks + +### Handling cancellation of plans +An additional config "cancellation-policy" can be added to the table service plan to indicate when it is ellgible to be permenatnly rolled back by writers other than the one responsbible for executing the table service. This policy can be a threshold of hours or instants on timeline, where if that # of hours/instants have elapsed since the plan was scheduled, any writer/operation can target it for rollback via clean. This policy should be configured by the writer scheduling a cacnellable table service, based on the amount of time they expect the plan to remain on the timeline before being picked up for execution. For example, if a table service writer is expected to immeditately start executing the plan after scheduling it, the the cancellation-policy can just be a few minutes. On the other hand, if the plan is expected to have its execution deferred to a few hours later, then the cancellation-policy should be more lenient. Note that this cancellation policy is not a repalacement fo r determining wether a table service plan is currently being executed - as wtih ingestion writes, cleanup of a cancellable table service plan should only start once it is confirmed that a ongoing writer is no longer progressing it. + +In order to ensure that other writers can indeed permenantely cancel a cancellable table service plan (such that it can no longer be executed), additional changes to clean and table service flow will be need to be added as well. Two proposals are detailed below. Also, note that the cancellation-policy is only required to be honored by clean: a user can choose setup an application to aggresively clean up a failed cancellable table service plan even if it has not meet the critera for its cancellation-policy yet. This can be useful if a user wants a utility to manually ensure that clean/archival for a dataset progresses immdeitately or knows that a cancellable table service plan will not be attempted again or cleaned up by another writer. Each proposal provides an example on how to achieve this. + +#### (A) Making cancellable plans "mutable" +Cancellable table service plans can be updated to have a "mutuable" plan, in the sense that once a plan is transitioned to inflight, if the execution of it fails the plan must be rolled back and deleted, similar to rollback of failed ingestion writes. The flow for table service execution will be similar to the existing one for immutable plan, except that if the plan is targeted by a rollback plan its execution will abort. + + + +Once cancellable table service plans are made mutable in this manner, clean can rollback failed cancellable table service plans that have met the cancellation-policy critera, similar to how clean currently rolls back failed ingestion writes. Specifically, clean can check for any failed cancelled table service plans that are already part of a pending rollback plan or meet the cancellation-policy. From there a rollback can be scheduled/executed for each instant. +With these changes, a failed cancellable table service plan that has met its cancellation policy will be guaranteed to be attempted for rollback by the next clean. If a user wants to immeditaly cleanup a failed cancellable plan, they can bypass the cancellation policy by scheduling and executing a rollback plan, the same way that clean will cleanup these plans. + +This meets the critera for goal (2). But comes with the following drawback: +* The instant metadata file for the cancellable table service plan will be deleted on rollback, analogous to how rollback of a ingestion instant works. This can make it more difficult to debug failed/stuck cancellable table service plans + +#### (B) Adding a cancel operation/state for cancellable plans +An alternate approach can involve updating the possible tmeline actions / states, by making the following changes: +* Add an ".aborted" state type for cancellable table service plan. +* Add a new action type "cancel" with two states ".cancel.requested" and ".cancel". The ".cancel.requested" metadata file will be a plan that targets a (cancellable table service) instant. Once said instant is transitioned to aborted state, the action can be completed and transitioned to ".cancel" + +A new cancel API will be added that a writer can use to target a cancellable table service plan to be aborted. It will create a cancel.request plan for the target instant, and execute it. If an existing cancel.requested plan for the target already exists, it will try to execute that directly (similar to how the rollback API handles pending rollbacks). Execution of a cancel action involves the followings steps +1. Rollback the instant without deleting the table service plan. +2. Transition the table service instant to .aborted, if it hasn't been already +3. Transition the cancel plan to .cancel +Once the cancel action has been transitioned to ".cancel", it can be considered complete. The reason this cancel action needs a ".requested" state is in order to allow clean/archival to be able to infer when a cancel action is completed. Review Comment: lets sync up f2f. I feel, we can further simplify this. ########## rfc/rfc-79/rfc-79-2.md: ########## @@ -0,0 +1,99 @@ +w<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# Add support for cancellable table service plans + +## Proposers + + +## Approvers + +## Status + +JIRA: HUDI-7946 + + +## Abstract +Table service plans can delay ingestion writes from updating a dataset with recent data if potential write conflicts are detected. Furthermore, a table service plan that isn't executed to completion for a large amount of time (due to repeated failures, application misconfiguration, or insufficient resources) will degrade the read/write performance of a dataset due to delaying clean, archival, and metadata table compaction. This is because currently HUDI table service plans, upon being scheduled, must be executed to completion. And additonally will prevent any ingestion write targeting the same files from succeeding (due to posing as a write conflict) as well as prevent any new table service plan from targeting the same files. Enabling a user to configure a table service plan as "cancellable" can prevent frequent or repeatedly failing table service plans from delaying ingestion. Support for cancellable plans will provide HUDI an avenue to fully cancel a table service plan and allow o ther table service to proceed. + + +## Background +### Execution of table services +The table service operations compact and cluster are by default "immutable" plans, meaning that once a plan is scheduled it will stay as as a pending instant until a caller invokes the table service execute API on the table service instant and sucessfully completes it. Specifically, if an inflight execution fails after transitioning the instant to inflight, the next execution attempt will implictly create and execute a rollback plan (which will delete all new instant/data files), but will keep the table service plan. This process will repeat until the instant is completed. The below visualization captures these transitions at a high level + + + +## Clean and rollback of failed writes +The clean table service, in addition to performing a clean action, is responsible for rolling back any failed ingestion writes (non-clustering/non-compaction inflight instants that are not being executed by a writer). This means that table services plans are not currently subject to clean. As detailed below, this proposal for supporting cancellable table service will require enabling clean be capable of targeting table service plans. + +## Goals +### (1) A cancellable plan should be pre-empted by other writers +The current requirement of HUDI needing to execute a table service plan to completion forces ingestion writers to abort a commit if a table service plan is conflicting. Becuase an ingestion writer typically determines the exact file groups it will be updating/replacing after building a workload profile and performing record tagging, the writer may have already spent a lot of time and resources before realizing that it needs to abort. In the face of frequent table service plans or an old inflight plan, this will cause delays in adding recent upstream records to the dataset as well as unecessairly take away resources (such as Spark executors in the case of Spark engine) from other applications in the data lake. A cancellable table service plan should avoid this situation by preventing itself from being comitted if a conflicting ingestion job has been comitted already. In conjunction, any ingestion writer or non-cancellable table service writer should be able to infer that a conflictin g inflight table service plan is cancellable, and therefore can be ignored when attempting to commit the instant. + +### (2) An inflight cancellable plan should be automatically cleaned up +Another consequence of this existing table service flow is that a table service plan cannot be subject to clean's rollback of failed writes. Clean typically performs a rollback of inflight instants that are no longer being progressed by a writer (and have an inactive heartbeat). Because table service plans needed to be executed to completion and don't have an active heartbeat these inflight plans cannot be subject to this cleanup. Because an inflight plan remaining on the timeline can degrade performance of reads/writes (as mentioned earlier), a cancellable table service plan should be elligible to be targeted for cleanup if HUDI clean deems that it has remaining inflight for too long (or some other critera). Note that a failed table service should still be able to be safely cleaned up immeditaley - the goal here is just to make sure an inflight plan won't stay on the timeline for an unbounded amount of time but also won't be likely to be prematurely cleaned up by clean before it ha s a chance to be executed. + +## Design +### Enabling a plan to be pre-emptable +To satisfy goal (1), a new config flag "cancellable" can be added to a table service plan. A writer that intends to schedule a cancellable table service plan can enable the flag in the serialized plan metadata. Any writer executing the plan can infer that the plan is cancellable, and when trying to commit the instant should abort if it detects that any ingestion write or table service plan (without cancellable config flag) is targeting the same file groups. On the other side, the commit finalization flow for ingestion writers can be updated to ignore any inflight table service plans if they are cancellable. +For the purpose of this design proposal, consider an ingestion job as having three steps: +1. Schedule itself on the timeline with a new instant time in a .requested file +2. Process/record tag incoming records, build a workload profile, and write the updating/replaced file groups to a "inflight" instant file on the timeline. Check for conflicts and abort if needed. +3. Perform write conflict checks and commit the instant on the timeline + +The aforementioned changes to ingestion and table service flow will ensure that in the event of a conflicting ingestion and cancellable table service writer, the ingestion job will take precedence unless the table service job was completed before (2). Since in this scenario the ingestion job will see that a completed instant (a cancellable table service action) conflicts with its ongoing inflight write, and therefore it would not be legal to proceed. Unfourtatnely this means that this design cannot compeletly guarantee that ingestion job will always take precedence. But future enhancements/hueristics can be explored to descrease the chance of this scenario, such as +* Have the ingestion writer write a "hint" of possible partitions it might affect in the .requested file, and the cancellable table service writer can check that before commiting the table service plan +* If the cancellable table service writer sees that there is a .requested file for an ingestion action, it can try to wait some time for the .inflight to appear before performing write reconcilation checks + +### Handling cancellation of plans +An additional config "cancellation-policy" can be added to the table service plan to indicate when it is ellgible to be permenatnly rolled back by writers other than the one responsbible for executing the table service. This policy can be a threshold of hours or instants on timeline, where if that # of hours/instants have elapsed since the plan was scheduled, any writer/operation can target it for rollback via clean. This policy should be configured by the writer scheduling a cacnellable table service, based on the amount of time they expect the plan to remain on the timeline before being picked up for execution. For example, if a table service writer is expected to immeditately start executing the plan after scheduling it, the the cancellation-policy can just be a few minutes. On the other hand, if the plan is expected to have its execution deferred to a few hours later, then the cancellation-policy should be more lenient. Note that this cancellation policy is not a repalacement fo r determining wether a table service plan is currently being executed - as wtih ingestion writes, cleanup of a cancellable table service plan should only start once it is confirmed that a ongoing writer is no longer progressing it. + +In order to ensure that other writers can indeed permenantely cancel a cancellable table service plan (such that it can no longer be executed), additional changes to clean and table service flow will be need to be added as well. Two proposals are detailed below. Also, note that the cancellation-policy is only required to be honored by clean: a user can choose setup an application to aggresively clean up a failed cancellable table service plan even if it has not meet the critera for its cancellation-policy yet. This can be useful if a user wants a utility to manually ensure that clean/archival for a dataset progresses immdeitately or knows that a cancellable table service plan will not be attempted again or cleaned up by another writer. Each proposal provides an example on how to achieve this. + +#### (A) Making cancellable plans "mutable" +Cancellable table service plans can be updated to have a "mutuable" plan, in the sense that once a plan is transitioned to inflight, if the execution of it fails the plan must be rolled back and deleted, similar to rollback of failed ingestion writes. The flow for table service execution will be similar to the existing one for immutable plan, except that if the plan is targeted by a rollback plan its execution will abort. + + + +Once cancellable table service plans are made mutable in this manner, clean can rollback failed cancellable table service plans that have met the cancellation-policy critera, similar to how clean currently rolls back failed ingestion writes. Specifically, clean can check for any failed cancelled table service plans that are already part of a pending rollback plan or meet the cancellation-policy. From there a rollback can be scheduled/executed for each instant. +With these changes, a failed cancellable table service plan that has met its cancellation policy will be guaranteed to be attempted for rollback by the next clean. If a user wants to immeditaly cleanup a failed cancellable plan, they can bypass the cancellation policy by scheduling and executing a rollback plan, the same way that clean will cleanup these plans. + +This meets the critera for goal (2). But comes with the following drawback: +* The instant metadata file for the cancellable table service plan will be deleted on rollback, analogous to how rollback of a ingestion instant works. This can make it more difficult to debug failed/stuck cancellable table service plans + +#### (B) Adding a cancel operation/state for cancellable plans +An alternate approach can involve updating the possible tmeline actions / states, by making the following changes: Review Comment: in this approach, when exactly the table service plan.requested meta file in timeline will be deleted ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
