This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 401733138a6 [HUDI-6856] Adding a page for partially failed commits 
(#9709)
401733138a6 is described below

commit 401733138a62cc264eb3590f80ddeeeec0fe2a05
Author: Sivabalan Narayanan <n.siv...@gmail.com>
AuthorDate: Wed Sep 20 19:33:27 2023 -0400

    [HUDI-6856] Adding a page for partially failed commits (#9709)
    
    * Adding a page for partially failed commits
    
    * address feedback
    
    * Fix intro in rollback page
    
    ---------
    
    Co-authored-by: Jonathan Vexler <=>
    Co-authored-by: Jon Vexler <jbvex...@gmail.com>
    Co-authored-by: Bhavani Sudha Saktheeswaran 
<2179254+bhasu...@users.noreply.github.com>
---
 website/docs/rollbacks.md                          |  69 +++++++++++++++++++++
 website/sidebars.js                                |   1 +
 .../blog/rollbacks/multi_writer_rollback.png       | Bin 0 -> 262204 bytes
 .../blog/rollbacks/single_write_rollback.png       | Bin 0 -> 182529 bytes
 4 files changed, 70 insertions(+)

diff --git a/website/docs/rollbacks.md b/website/docs/rollbacks.md
new file mode 100644
index 00000000000..295ce70c0dc
--- /dev/null
+++ b/website/docs/rollbacks.md
@@ -0,0 +1,69 @@
+---
+title: Rollback Mechanism
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+---
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in 
the code, unavailability of any external 
+third-party system (like a lock provider), or user could kill the job midway 
to change some properties. A well-designed 
+system should detect such partially failed commits, ensure dirty data is not 
exposed to the queries, and clean them up.
+Hudi's rollback mechanism takes care of cleaning up such failed writes. 
+
+Hudi’s timeline forms the core for reader and writer isolation. If a commit 
has not transitioned to complete as per the
+hudi timeline, the readers will ignore the data from the respective write. And 
so partially failed writes are never read
+by any readers (for all query types). But the curious question is, how is the 
partially written data eventually deleted? 
+Does it require manual command to be executed from time to time or should it 
be automatically handled by the system? This
+page presents insights on how "rollback" in Hudi can automatically clean up 
handling partially failed writes without 
+manual input from users.
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the 
operationalization of lakehouse tables. One such feature 
+is the automatic cleanup of partially failed commits. Users don’t need to run 
any additional commands to clean up dirty 
+data or the data produced by failed commits. If you continue to write to hudi 
tables, one of your future commits will 
+take care of cleaning up older data that failed midway during a write/commit. 
We call this cleanup of a failed commit a 
+"rollback". A rollback will revert everything about a commit, including 
deleting data and removal from the timeline. 
+Additionally, the restore operation utilizes a series rollbacks to undo 
completed commits.
+
+Let’s zoom in a bit and understand how such cleanups happen and the challenges 
involved in such cleaning when 
+multi-writers are involved.
+
+### Rolling back partially failed commits for a single writer
+In case of single writer model, the rollback logic is fairly straightforward. 
Every action in Hudi's timeline goes 
+through 3 states, namely requested, inflight and completed. Whenever a new 
commit starts, hudi checks the timeline 
+for any actions/commits that is not yet committed and that refers to partially 
failed commit. So, immediately rollback 
+is triggered and all dirty data is cleaned up followed by cleaning up the 
commit instants from the timeline.
+
+
+![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png)
+_Figure 1: single writer with eager rollbacks_
+
+
+### Rolling back of partially failed commits w/ multi-writers
+The challenging part is when multi-writers are invoked. Just because a commit 
is still non-completed as per the 
+timeline, it does not mean current writer (new) can assume that it's a 
partially failed commit. Because, there could be 
+a concurrent writer that’s currently making progress. Hudi has been designed 
to not have any centralized server 
+running always and in such a case Hudi has an ingenious way to deduce such 
partially failed writes.
+
+#### Heartbeats
+We are leveraging heartbeats to our rescue here. Each commit will keep 
emitting heartbeats from the start of the 
+write until its completion. During rollback deduction, Hudi checks for 
heartbeat timeouts for all ongoing or incomplete 
+commits and detects partially failed commits on such timeouts. For any ongoing 
commits, the heartbeat should not 
+have elapsed the timeout. For example, if a commit’s heartbeat is not updated 
for 10+ mins, we can safely assume the 
+original writer has failed/crashed and is the incomplete commit is safe to 
clean up. So, the rollbacks in case of 
+multi-writers are lazy and is not eager as we saw with single writer model. 
But it is still automatic and users don’t 
+need to execute any explicit command to trigger such cleanup of failed writes. 
When such lazy rollback kicks in, both 
+data files and timeline files for the failed writes are deleted.
+
+Hudi employs a simple yet effective heartbeat mechanism to notify that a 
commit is still making progress. A heartbeat 
+file is created for every commit under “.hoodie/.heartbeat/” (for eg, 
“.hoodie/.heartbeat/20230819183853177”). 
+The writer will start a background thread which will keep updating this 
heartbeat file at a regular cadence to refresh
+the last modification time of the file. So, checking for last modification 
time of the heartbeat file gives us 
+information whether the writer that started the commit of interest is still 
making progress or not. On completion of 
+the commit, the heartbeat file is deleted. Or if the write failed midway, the 
last modification time of the heartbeat 
+file is no longer updated, so other writers can deduce the failed write after 
a period of time elapses.
+
+![An example illustration of multi writer 
rollbacks](/assets/images/blog/rollbacks/multi_writer_rollback.png)
+_Figure 2: multi-writer with lazy cleaning of failed commits_
+
+
diff --git a/website/sidebars.js b/website/sidebars.js
index eee5f1b0f5d..e4e301095de 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -72,6 +72,7 @@ module.exports = {
                 'metadata_indexing',
                 'hoodie_cleaner',
                 'transforms',
+                'rollbacks',
                 'markers',
                 'file_sizing',
                 'disaster_recovery',
diff --git 
a/website/static/assets/images/blog/rollbacks/multi_writer_rollback.png 
b/website/static/assets/images/blog/rollbacks/multi_writer_rollback.png
new file mode 100644
index 00000000000..f56beb7e849
Binary files /dev/null and 
b/website/static/assets/images/blog/rollbacks/multi_writer_rollback.png differ
diff --git 
a/website/static/assets/images/blog/rollbacks/single_write_rollback.png 
b/website/static/assets/images/blog/rollbacks/single_write_rollback.png
new file mode 100644
index 00000000000..5df7c4160ad
Binary files /dev/null and 
b/website/static/assets/images/blog/rollbacks/single_write_rollback.png differ

Reply via email to