This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 401733138a6 [HUDI-6856] Adding a page for partially failed commits
(#9709)
401733138a6 is described below
commit 401733138a62cc264eb3590f80ddeeeec0fe2a05
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Wed Sep 20 19:33:27 2023 -0400
[HUDI-6856] Adding a page for partially failed commits (#9709)
* Adding a page for partially failed commits
* address feedback
* Fix intro in rollback page
---------
Co-authored-by: Jonathan Vexler <=>
Co-authored-by: Jon Vexler <[email protected]>
Co-authored-by: Bhavani Sudha Saktheeswaran
<[email protected]>
---
website/docs/rollbacks.md | 69 +++++++++++++++++++++
website/sidebars.js | 1 +
.../blog/rollbacks/multi_writer_rollback.png | Bin 0 -> 262204 bytes
.../blog/rollbacks/single_write_rollback.png | Bin 0 -> 182529 bytes
4 files changed, 70 insertions(+)
diff --git a/website/docs/rollbacks.md b/website/docs/rollbacks.md
new file mode 100644
index 00000000000..295ce70c0dc
--- /dev/null
+++ b/website/docs/rollbacks.md
@@ -0,0 +1,69 @@
+---
+title: Rollback Mechanism
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+---
+
+Your pipelines could fail due to numerous reasons like crashes, valid bugs in
the code, unavailability of any external
+third-party system (like a lock provider), or user could kill the job midway
to change some properties. A well-designed
+system should detect such partially failed commits, ensure dirty data is not
exposed to the queries, and clean them up.
+Hudi's rollback mechanism takes care of cleaning up such failed writes.
+
+Hudi’s timeline forms the core for reader and writer isolation. If a commit
has not transitioned to complete as per the
+hudi timeline, the readers will ignore the data from the respective write. And
so partially failed writes are never read
+by any readers (for all query types). But the curious question is, how is the
partially written data eventually deleted?
+Does it require manual command to be executed from time to time or should it
be automatically handled by the system? This
+page presents insights on how "rollback" in Hudi can automatically clean up
handling partially failed writes without
+manual input from users.
+
+### Handling partially failed commits
+Hudi has a lot of platformization built in so as to ease the
operationalization of lakehouse tables. One such feature
+is the automatic cleanup of partially failed commits. Users don’t need to run
any additional commands to clean up dirty
+data or the data produced by failed commits. If you continue to write to hudi
tables, one of your future commits will
+take care of cleaning up older data that failed midway during a write/commit.
We call this cleanup of a failed commit a
+"rollback". A rollback will revert everything about a commit, including
deleting data and removal from the timeline.
+Additionally, the restore operation utilizes a series rollbacks to undo
completed commits.
+
+Let’s zoom in a bit and understand how such cleanups happen and the challenges
involved in such cleaning when
+multi-writers are involved.
+
+### Rolling back partially failed commits for a single writer
+In case of single writer model, the rollback logic is fairly straightforward.
Every action in Hudi's timeline goes
+through 3 states, namely requested, inflight and completed. Whenever a new
commit starts, hudi checks the timeline
+for any actions/commits that is not yet committed and that refers to partially
failed commit. So, immediately rollback
+is triggered and all dirty data is cleaned up followed by cleaning up the
commit instants from the timeline.
+
+
+
+_Figure 1: single writer with eager rollbacks_
+
+
+### Rolling back of partially failed commits w/ multi-writers
+The challenging part is when multi-writers are invoked. Just because a commit
is still non-completed as per the
+timeline, it does not mean current writer (new) can assume that it's a
partially failed commit. Because, there could be
+a concurrent writer that’s currently making progress. Hudi has been designed
to not have any centralized server
+running always and in such a case Hudi has an ingenious way to deduce such
partially failed writes.
+
+#### Heartbeats
+We are leveraging heartbeats to our rescue here. Each commit will keep
emitting heartbeats from the start of the
+write until its completion. During rollback deduction, Hudi checks for
heartbeat timeouts for all ongoing or incomplete
+commits and detects partially failed commits on such timeouts. For any ongoing
commits, the heartbeat should not
+have elapsed the timeout. For example, if a commit’s heartbeat is not updated
for 10+ mins, we can safely assume the
+original writer has failed/crashed and is the incomplete commit is safe to
clean up. So, the rollbacks in case of
+multi-writers are lazy and is not eager as we saw with single writer model.
But it is still automatic and users don’t
+need to execute any explicit command to trigger such cleanup of failed writes.
When such lazy rollback kicks in, both
+data files and timeline files for the failed writes are deleted.
+
+Hudi employs a simple yet effective heartbeat mechanism to notify that a
commit is still making progress. A heartbeat
+file is created for every commit under “.hoodie/.heartbeat/” (for eg,
“.hoodie/.heartbeat/20230819183853177”).
+The writer will start a background thread which will keep updating this
heartbeat file at a regular cadence to refresh
+the last modification time of the file. So, checking for last modification
time of the heartbeat file gives us
+information whether the writer that started the commit of interest is still
making progress or not. On completion of
+the commit, the heartbeat file is deleted. Or if the write failed midway, the
last modification time of the heartbeat
+file is no longer updated, so other writers can deduce the failed write after
a period of time elapses.
+
+
+_Figure 2: multi-writer with lazy cleaning of failed commits_
+
+
diff --git a/website/sidebars.js b/website/sidebars.js
index eee5f1b0f5d..e4e301095de 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -72,6 +72,7 @@ module.exports = {
'metadata_indexing',
'hoodie_cleaner',
'transforms',
+ 'rollbacks',
'markers',
'file_sizing',
'disaster_recovery',
diff --git
a/website/static/assets/images/blog/rollbacks/multi_writer_rollback.png
b/website/static/assets/images/blog/rollbacks/multi_writer_rollback.png
new file mode 100644
index 00000000000..f56beb7e849
Binary files /dev/null and
b/website/static/assets/images/blog/rollbacks/multi_writer_rollback.png differ
diff --git
a/website/static/assets/images/blog/rollbacks/single_write_rollback.png
b/website/static/assets/images/blog/rollbacks/single_write_rollback.png
new file mode 100644
index 00000000000..5df7c4160ad
Binary files /dev/null and
b/website/static/assets/images/blog/rollbacks/single_write_rollback.png differ