[GitHub] [hudi] yihua commented on a diff in pull request #9709: [HUDI-6856] Adding a page for partially failed commits

via GitHub Thu, 14 Sep 2023 11:49:21 -0700


yihua commented on code in PR #9709:
URL: https://github.com/apache/hudi/pull/9709#discussion_r1326376299



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 

Review Comment:
   ```suggestion
   detect such partially failed commits, ensure dirty data is not exposed to 
the queries, and clean them up. 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 

Review Comment:
   ```suggestion
   third-party system (like a lock provider), or user could kill the job 
mid-way to change some properties. A well-designed system should 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 
+when multi-writers are involved.
+
+#### Rolling back partially failed commits for a single writer 
+Incase of single writer model, the rollback logic is fairly straightforward. 
Every action in Hudi's timeline, goes 

Review Comment:
   ```suggestion
   In case of single writer model, the rollback logic is fairly 
straightforward. Every action in Hudi's timeline goes 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 

Review Comment:
   ```suggestion
   Let’s zoom in a bit and understand how such clean-ups happen and the 
challenges involved in such cleaning 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 

Review Comment:
   ```suggestion
   Let’s zoom in a bit and understand how such clean-ups happen and the 
challenges involved in such cleaning 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 

Review Comment:
   ```suggestion
   is the automatic clean-up of partially failed commits. Users don’t need to 
run any additional commands to clean up dirty 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 

Review Comment:
   ```suggestion
   We have already taken a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 

Review Comment:
   ```suggestion
   is the partially written data eventually deleted? Does it require manual 
command to be executed from time to time 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 
+when multi-writers are involved.
+
+#### Rolling back partially failed commits for a single writer 
+Incase of single writer model, the rollback logic is fairly straightforward. 
Every action in Hudi's timeline, goes 
+through 3 states, namely requested, inflight and completed. Whenever a new 
commit starts, hudi checks the timeline 
+for any actions/commits that is not yet committed and that refers to partially 
failed commit. So, immediately rollback 
+is triggered and all dirty data is cleaned up followed by cleaning up the 
commit instants from the timeline.
+
+
+![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png)
+
+
+#### Rolling back of partially failed commits w/ multi-writers
+The challenging part is when multi-writers are invoked. Just because, some 
commit is still non-completed as per the 
+timeline, it does not mean current writer (new) can assume its a partially 
failed commit. Because, there could be a 

Review Comment:
   ```suggestion
   timeline, it does not mean current writer (new) can assume that it's a 
partially failed commit. Because, there could be a 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 
+when multi-writers are involved.
+
+#### Rolling back partially failed commits for a single writer 
+Incase of single writer model, the rollback logic is fairly straightforward. 
Every action in Hudi's timeline, goes 
+through 3 states, namely requested, inflight and completed. Whenever a new 
commit starts, hudi checks the timeline 
+for any actions/commits that is not yet committed and that refers to partially 
failed commit. So, immediately rollback 
+is triggered and all dirty data is cleaned up followed by cleaning up the 
commit instants from the timeline.
+
+
+![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png)
+
+
+#### Rolling back of partially failed commits w/ multi-writers
+The challenging part is when multi-writers are invoked. Just because, some 
commit is still non-completed as per the 
+timeline, it does not mean current writer (new) can assume its a partially 
failed commit. Because, there could be a 
+concurrent writer that’s currently making progress. And Hudi has been designed 
to not have any centralized server 
+running always and so hudi has a  ingenious way to deduce such partially 
failed writes.

Review Comment:
   ```suggestion
   running always and in such a case Hudi has an ingenious way to deduce such 
partially failed writes.
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 

Review Comment:
   ```suggestion
   is the automatic clean-up of partially failed commits. Users don’t need to 
run any additional commands to clean up dirty 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 
+when multi-writers are involved.
+
+#### Rolling back partially failed commits for a single writer 
+Incase of single writer model, the rollback logic is fairly straightforward. 
Every action in Hudi's timeline, goes 
+through 3 states, namely requested, inflight and completed. Whenever a new 
commit starts, hudi checks the timeline 
+for any actions/commits that is not yet committed and that refers to partially 
failed commit. So, immediately rollback 
+is triggered and all dirty data is cleaned up followed by cleaning up the 
commit instants from the timeline.
+
+
+![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png)
+
+
+#### Rolling back of partially failed commits w/ multi-writers
+The challenging part is when multi-writers are invoked. Just because, some 
commit is still non-completed as per the 

Review Comment:
   ```suggestion
   The challenging part is when multi-writers are invoked. Just because a 
commit is still non-completed as per the 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 
+when multi-writers are involved.
+
+#### Rolling back partially failed commits for a single writer 
+Incase of single writer model, the rollback logic is fairly straightforward. 
Every action in Hudi's timeline, goes 
+through 3 states, namely requested, inflight and completed. Whenever a new 
commit starts, hudi checks the timeline 
+for any actions/commits that is not yet committed and that refers to partially 
failed commit. So, immediately rollback 
+is triggered and all dirty data is cleaned up followed by cleaning up the 
commit instants from the timeline.
+
+
+![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png)
+
+
+#### Rolling back of partially failed commits w/ multi-writers
+The challenging part is when multi-writers are invoked. Just because, some 
commit is still non-completed as per the 
+timeline, it does not mean current writer (new) can assume its a partially 
failed commit. Because, there could be a 
+concurrent writer that’s currently making progress. And Hudi has been designed 
to not have any centralized server 
+running always and so hudi has a  ingenious way to deduce such partially 
failed writes.
+
+##### Heart beats to the rescue

Review Comment:
   ```suggestion
   ##### Heartbeats to the rescue
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 

Review Comment:
   ```suggestion
   We have already taken a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 
+when multi-writers are involved.
+
+#### Rolling back partially failed commits for a single writer 
+Incase of single writer model, the rollback logic is fairly straightforward. 
Every action in Hudi's timeline, goes 
+through 3 states, namely requested, inflight and completed. Whenever a new 
commit starts, hudi checks the timeline 
+for any actions/commits that is not yet committed and that refers to partially 
failed commit. So, immediately rollback 
+is triggered and all dirty data is cleaned up followed by cleaning up the 
commit instants from the timeline.
+
+
+![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png)
+
+
+#### Rolling back of partially failed commits w/ multi-writers
+The challenging part is when multi-writers are invoked. Just because, some 
commit is still non-completed as per the 
+timeline, it does not mean current writer (new) can assume its a partially 
failed commit. Because, there could be a 
+concurrent writer that’s currently making progress. And Hudi has been designed 
to not have any centralized server 

Review Comment:
   ```suggestion
   concurrent writer that’s currently making progress. Hudi has been designed 
not to have any centralized server 
   ```



##########
website/docs/rollbacks.md:
##########
@@ -0,0 +1,67 @@
+---
+title: Partially Failed Commits
+toc: true
+---
+
+## Partially failed commits
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third party system (like lock provider), or user could kill mid-way to change 
some properties. A well designed system should 
+detect such partially failed commits and ensure dirty data is not exposed to 
the read queries and also clean them up. 
+We have already took a peek into Hudi’s timeline which forms the core for 
reader and writer isolation. If a commit has 
+not transitioned to complete as per the hudi timeline, the readers will ignore 
the data from the respective write. 
+And so partially failed writes are never read by any readers (for all query 
types). But the curious question is, how 
+does the partially written data is eventually deleted? Does it require manual 
command to be executed from time to time 
+or should it be automatically handled by the system?
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. Once such feature 
+is the automatic clean up of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed mid-way during a write/commit. 
This keeps the storage in bounds w/o 
+requiring any manual intervention from the users. 
+
+Let’s zoom in a bit and understand how such clean ups happen and is there any 
challenges involved in such cleaning 
+when multi-writers are involved.
+
+#### Rolling back partially failed commits for a single writer 
+Incase of single writer model, the rollback logic is fairly straightforward. 
Every action in Hudi's timeline, goes 
+through 3 states, namely requested, inflight and completed. Whenever a new 
commit starts, hudi checks the timeline 
+for any actions/commits that is not yet committed and that refers to partially 
failed commit. So, immediately rollback 
+is triggered and all dirty data is cleaned up followed by cleaning up the 
commit instants from the timeline.
+
+
+![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png)
+
+
+#### Rolling back of partially failed commits w/ multi-writers
+The challenging part is when multi-writers are invoked. Just because, some 
commit is still non-completed as per the 
+timeline, it does not mean current writer (new) can assume its a partially 
failed commit. Because, there could be a 
+concurrent writer that’s currently making progress. And Hudi has been designed 
to not have any centralized server 

Review Comment:
   ```suggestion
   concurrent writer that’s currently making progress. Hudi has been designed 
not to have any centralized server 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on a diff in pull request #9709: [HUDI-6856] Adding a page for partially failed commits

Reply via email to