This is an automated email from the ASF dual-hosted git repository.
sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new bae3635 [HUDI-1273] Adding savepoint and restore docs to website
(#4715)
bae3635 is described below
commit bae363567dcac41e1e39bf858fd2bb91d887c725
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Sun Feb 20 15:49:48 2022 -0500
[HUDI-1273] Adding savepoint and restore docs to website (#4715)
---
website/docs/disaster_recovery.md | 296 ++++++++++++++++++++++++++++++++++++++
website/sidebars.js | 1 +
2 files changed, 297 insertions(+)
diff --git a/website/docs/disaster_recovery.md
b/website/docs/disaster_recovery.md
new file mode 100644
index 0000000..6afe357
--- /dev/null
+++ b/website/docs/disaster_recovery.md
@@ -0,0 +1,296 @@
+---
+title: Disaster and Recovery with Apache Hudi
+toc: true
+---
+
+Disaster Recovery is very much mission critical for any software. Especially
when it comes to data systems, the impact could be very serious
+leading to delay in business decisions or even wrong business decisions at
times. Apache Hudi has two operations to assist you in recovering
+data from a previous state: "savepoint" and "restore".
+
+## Savepoint
+
+As the name suggest, "savepoint" saves the table as of the commit time, so
that it lets you restore the table to this
+savepoint at a later point in time if need be. Care is taken to ensure cleaner
will not clean up any files that are savepointed.
+On similar lines, savepoint cannot be triggered on a commit that is already
cleaned up. In simpler terms, this is synonymous
+to taking a backup, just that we don't make a new copy of the table, but just
save the state of the table elegantly so that
+we can restore it later when in need.
+
+## Restore
+
+This operation lets you restore your table to one of the savepoint commit.
This operation cannot be undone (or reversed) and so care
+should be taken before doing a restore. Hudi will delete all data files and
commit files (timeline files) greater than the
+savepoint commit to which the table is being restored. You should pause all
writes to the table when performing
+a restore since they are likely to fail while the restore is in progress.
Also, reads could also fail since snapshot queries
+will be hitting latest files which has high possibility of getting deleted
with restore.
+
+## Runbook
+
+Savepoint and restore can only be triggered from hudi-cli. Lets walk through
an example of how one can take savepoint
+and later restore the state of the table.
+
+Lets create a hudi table via spark-shell. I am going to trigger few batches of
inserts.
+
+```scala
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+// spark-shell
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME, tableName).
+ mode(Overwrite).
+ save(basePath)
+```
+
+Each batch inserst 10 records. Repeating for 4 more batches.
+```scala
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME, tableName).
+ mode(Append).
+ save(basePath)
+```
+
+Total record count should be 50.
+```scala
+val tripsSnapshotDF = spark.
+ read.
+ format("hudi").
+ load(basePath)
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+
+spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot
").show()
+
++--------------------------+
+|count(partitionpath, uuid)|
+ +--------------------------+
+| 50|
+ +--------------------------+
+```
+Let's take a look at the timeline after 5 batch of inserts.
+```shell
+ls -ltr /tmp/hudi_trips_cow/.hoodie
+total 128
+drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived
+-rw-r--r-- 1 nsb wheel 546 Jan 28 16:00 hoodie.properties
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:00 20220128160040171.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:00 20220128160040171.inflight
+-rw-r--r-- 1 nsb wheel 4374 Jan 28 16:00 20220128160040171.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:01 20220128160124637.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:01 20220128160124637.inflight
+-rw-r--r-- 1 nsb wheel 4414 Jan 28 16:01 20220128160124637.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160226172.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160226172.inflight
+-rw-r--r-- 1 nsb wheel 4427 Jan 28 16:02 20220128160226172.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160229636.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160229636.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160229636.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160245447.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160245447.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit
+```
+
+Let's trigger a savepoint as of the latest commit. Savepoint can only be done
via hudi-cli.
+
+```sh
+./hudi-cli.sh
+
+connect --path /tmp/hudi_trips_cow/
+commits show
+set --conf SPARK_HOME=<SPARK_HOME>
+savepoint create --commit 20220128160245447 --sparkMaster local[2]
+```
+
+Let's check the timeline after savepoint.
+```shell
+ls -ltr /tmp/hudi_trips_cow/.hoodie
+total 136
+drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived
+-rw-r--r-- 1 nsb wheel 546 Jan 28 16:00 hoodie.properties
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:00 20220128160040171.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:00 20220128160040171.inflight
+-rw-r--r-- 1 nsb wheel 4374 Jan 28 16:00 20220128160040171.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:01 20220128160124637.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:01 20220128160124637.inflight
+-rw-r--r-- 1 nsb wheel 4414 Jan 28 16:01 20220128160124637.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160226172.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160226172.inflight
+-rw-r--r-- 1 nsb wheel 4427 Jan 28 16:02 20220128160226172.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160229636.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160229636.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160229636.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160245447.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160245447.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:05
20220128160245447.savepoint.inflight
+-rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint
+```
+
+You could notice that savepoint meta files are added which keeps track of the
files that are part of the latest table snapshot.
+
+Now, lets continue adding few more batches of inserts.
+Repeat below commands for 3 times.
+```scala
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME, tableName).
+ mode(Append).
+ save(basePath)
+```
+
+Total record count will be 80 since we have done 8 batches in total. (5 until
savepoint and 3 after savepoint)
+```scala
+val tripsSnapshotDF = spark.
+ read.
+ format("hudi").
+ load(basePath)
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+
+spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot
").show()
++--------------------------+
+|count(partitionpath, uuid)|
+ +--------------------------+
+| 80|
+ +--------------------------+
+```
+
+Let's say something bad happened and you want to restore your table to a older
snapshot. As we called out earlier, we can
+trigger restore only from hudi-cli. And do remember to bring down all of your
writer processes while doing a restore.
+
+Lets checkout timeline once, before we trigger the restore.
+```shell
+ls -ltr /tmp/hudi_trips_cow/.hoodie
+total 208
+drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived
+-rw-r--r-- 1 nsb wheel 546 Jan 28 16:00 hoodie.properties
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:00 20220128160040171.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:00 20220128160040171.inflight
+-rw-r--r-- 1 nsb wheel 4374 Jan 28 16:00 20220128160040171.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:01 20220128160124637.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:01 20220128160124637.inflight
+-rw-r--r-- 1 nsb wheel 4414 Jan 28 16:01 20220128160124637.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160226172.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160226172.inflight
+-rw-r--r-- 1 nsb wheel 4427 Jan 28 16:02 20220128160226172.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160229636.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160229636.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160229636.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160245447.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160245447.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:05
20220128160245447.savepoint.inflight
+-rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:06 20220128160620557.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:06 20220128160620557.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160620557.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:06 20220128160627501.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:06 20220128160627501.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160627501.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:06 20220128160630785.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:06 20220128160630785.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit
+```
+
+If you are continuing in the same hudi-cli session, you can just execute
"refresh" so that table state gets refreshed to
+its latest state. If not, connect to the table again.
+
+```shell
+./hudi-cli.sh
+
+connect --path /tmp/hudi_trips_cow/
+commits show
+set --conf SPARK_HOME=<SPARK_HOME>
+savepoints show
+╔═══════════════════╗
+║ SavepointTime ║
+╠═══════════════════╣
+║ 20220128160245447 ║
+╚═══════════════════╝
+savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2]
+```
+
+Hudi table should have been restored to the savepointed commit
20220128160245447. Both data files and timeline files should have
+been deleted.
+```shell
+ls -ltr /tmp/hudi_trips_cow/.hoodie
+total 152
+drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived
+-rw-r--r-- 1 nsb wheel 546 Jan 28 16:00 hoodie.properties
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:00 20220128160040171.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:00 20220128160040171.inflight
+-rw-r--r-- 1 nsb wheel 4374 Jan 28 16:00 20220128160040171.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:01 20220128160124637.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:01 20220128160124637.inflight
+-rw-r--r-- 1 nsb wheel 4414 Jan 28 16:01 20220128160124637.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160226172.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160226172.inflight
+-rw-r--r-- 1 nsb wheel 4427 Jan 28 16:02 20220128160226172.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160229636.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160229636.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160229636.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:02 20220128160245447.commit.requested
+-rw-r--r-- 1 nsb wheel 2594 Jan 28 16:02 20220128160245447.inflight
+-rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:05
20220128160245447.savepoint.inflight
+-rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint
+-rw-r--r-- 1 nsb wheel 0 Jan 28 16:07 20220128160732437.restore.inflight
+-rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore
+```
+
+Lets check the total record count in the table. Should match the records we
had, just before we triggered the savepoint.
+```scala
+val tripsSnapshotDF = spark.
+ read.
+ format("hudi").
+ load(basePath)
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+
+spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot
").show()
++--------------------------+
+|count(partitionpath, uuid)|
+ +--------------------------+
+| 50|
+ +--------------------------+
+```
+
+As you could see, entire table state is restored back to the commit which was
savepointed. Users can choose to trigger savepoint
+at regular cadence and keep deleting older savepoints when new ones are
created. Hudi-cli has a command "savepoint delete"
+to assist in deleting a savepoint. Please do remember that cleaner may not
clean the files that are savepointed. And so users
+should ensure they delete the savepoints from time to time. If not, the
storage reclamation may not happen.
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/website/sidebars.js b/website/sidebars.js
index d7736c3..fe4b0b4 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -56,6 +56,7 @@ module.exports = {
'transforms',
'markers',
'file_sizing',
+ 'disaster_recovery',
'snapshot_exporter',
'precommit_validator'
],