This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 7dde9247a7b [MINOR] update disaster recovery docs (#10164) 7dde9247a7b is described below commit 7dde9247a7ba6a114a1308e8801ac122fdb5d148 Author: Sagar Lakshmipathy <18vidhyasa...@gmail.com> AuthorDate: Tue Nov 28 04:40:20 2023 -0800 [MINOR] update disaster recovery docs (#10164) * updated schema evolution docs * added a for loop to loop through the inserts fixed indentation added note to make sure users replace the commit and savepoint ts * Revert "added a for loop to loop through the inserts" This reverts commit 73f274546ebd281fe0c10fb77451ebf3fc975098. * added a for loop to loop through the inserts added note to make sure users replace the commit and savepoint ts made markdown edits fixed indentation * Revert "updated schema evolution docs" This reverts commit c174805e428b246faff48d3e3f292bc88f493e6a. * tested and propagated the changes all the way upto 0.12.0 --- website/docs/disaster_recovery.md | 149 ++++++++++---------- .../version-0.12.0/disaster_recovery.md | 156 +++++++++++---------- .../version-0.12.1/disaster_recovery.md | 156 +++++++++++---------- .../version-0.12.2/disaster_recovery.md | 156 +++++++++++---------- .../version-0.12.3/disaster_recovery.md | 156 +++++++++++---------- .../version-0.13.0/disaster_recovery.md | 156 +++++++++++---------- .../version-0.13.1/disaster_recovery.md | 149 ++++++++++---------- .../version-0.14.0/disaster_recovery.md | 149 ++++++++++---------- 8 files changed, 627 insertions(+), 600 deletions(-) diff --git a/website/docs/disaster_recovery.md b/website/docs/disaster_recovery.md index c2f53bc8cd7..b95085d358b 100644 --- a/website/docs/disaster_recovery.md +++ b/website/docs/disaster_recovery.md @@ -3,32 +3,32 @@ title: Disaster Recovery toc: true --- -Disaster Recovery is very much mission critical for any software. Especially when it comes to data systems, the impact could be very serious +Disaster Recovery is very much mission-critical for any software. Especially when it comes to data systems, the impact could be very serious leading to delay in business decisions or even wrong business decisions at times. Apache Hudi has two operations to assist you in recovering -data from a previous state: "savepoint" and "restore". +data from a previous state: `savepoint` and `restore`. ## Savepoint -As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. -On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous -to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that -we can restore it later when in need. +As the name suggest, `savepoint` saves the table as of the commit time, so that it lets you restore the table to this +savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. +On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous +to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that +we can restore it later when in need. ## Restore -This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care +This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care should be taken before doing a restore. Hudi will delete all data files and commit files (timeline files) greater than the savepoint commit to which the table is being restored. You should pause all writes to the table when performing -a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries -will be hitting latest files which has high possibility of getting deleted with restore. +a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries +will be hitting latest files which has high possibility of getting deleted with restore. ## Runbook -Savepoint and restore can only be triggered from hudi-cli. Lets walk through an example of how one can take savepoint -and later restore the state of the table. +Savepoint and restore can only be triggered from `hudi-cli`. Let's walk through an example of how one can take savepoint +and later restore the state of the table. -Lets create a hudi table via spark-shell. I am going to trigger few batches of inserts. +Let's create a hudi table via `spark-shell` and trigger a batch of inserts. ```scala import org.apache.hudi.QuickstartUtils._ @@ -42,7 +42,6 @@ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -// spark-shell val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) df.write.format("hudi"). @@ -55,22 +54,23 @@ df.write.format("hudi"). save(basePath) ``` -Each batch inserst 10 records. Repeating for 4 more batches. +Let's add four more batches of inserts. ```scala - -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 4) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` -Total record count should be 50. +Total record count should be 50. ```scala val tripsSnapshotDF = spark. read. @@ -78,15 +78,15 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -Let's take a look at the timeline after 5 batch of inserts. +Let's take a look at the timeline after 5 batches of inserts. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 128 @@ -109,7 +109,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit ``` -Let's trigger a savepoint as of the latest commit. Savepoint can only be done via hudi-cli. +Let's trigger a savepoint as of the latest commit. Savepoint can only be done via `hudi-cli`. ```sh ./hudi-cli.sh @@ -120,7 +120,11 @@ set --conf SPARK_HOME=<SPARK_HOME> savepoint create --commit 20220128160245447 --sparkMaster local[2] ``` -Let's check the timeline after savepoint. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest commit in your table. +::: + +Let's check the timeline after savepoint. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 136 @@ -145,21 +149,23 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint ``` -You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. +You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. + +Now, let's continue adding three more batches of inserts. -Now, lets continue adding few more batches of inserts. -Repeat below commands for 3 times. ```scala -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 3) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` Total record count will be 80 since we have done 8 batches in total. (5 until savepoint and 3 after savepoint) @@ -170,18 +176,18 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 80| - +--------------------------+ ++--------------------------+ ``` -Let's say something bad happened and you want to restore your table to a older snapshot. As we called out earlier, we can -trigger restore only from hudi-cli. And do remember to bring down all of your writer processes while doing a restore. +Let's say something bad happened, and you want to restore your table to an older snapshot. As we called out earlier, we can +trigger restore only from `hudi-cli`. And do remember to bring down all of your writer processes while doing a restore. -Lets checkout timeline once, before we trigger the restore. +Let's checkout timeline once, before we trigger the restore. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 208 @@ -215,8 +221,8 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit ``` -If you are continuing in the same hudi-cli session, you can just execute "refresh" so that table state gets refreshed to -its latest state. If not, connect to the table again. +If you are continuing in the same `hudi-cli` session, you can just execute `refresh` so that table state gets refreshed to +its latest state. If not, connect to the table again. ```shell ./hudi-cli.sh @@ -233,8 +239,12 @@ savepoints show savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2] ``` -Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have -been deleted. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest savepoint in your table. +::: + +Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have +been deleted. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 152 @@ -261,7 +271,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore ``` -Lets check the total record count in the table. Should match the records we had, just before we triggered the savepoint. +Let's check the total record count in the table. Should match the records we had, just before we triggered the savepoint. ```scala val tripsSnapshotDF = spark. read. @@ -269,28 +279,17 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint -at regular cadence and keep deleting older savepoints when new ones are created. Hudi-cli has a command "savepoint delete" -to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users -should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. +As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint +at regular cadence and keep deleting older savepoints when new ones are created. `hudi-cli` has a command `savepoint delete` +to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users +should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. Note: Savepoint and restore for MOR table is available only from 0.11. - - - - - - - - - - - diff --git a/website/versioned_docs/version-0.12.0/disaster_recovery.md b/website/versioned_docs/version-0.12.0/disaster_recovery.md index c2f53bc8cd7..0f7c198d99b 100644 --- a/website/versioned_docs/version-0.12.0/disaster_recovery.md +++ b/website/versioned_docs/version-0.12.0/disaster_recovery.md @@ -3,32 +3,32 @@ title: Disaster Recovery toc: true --- -Disaster Recovery is very much mission critical for any software. Especially when it comes to data systems, the impact could be very serious +Disaster Recovery is very much mission-critical for any software. Especially when it comes to data systems, the impact could be very serious leading to delay in business decisions or even wrong business decisions at times. Apache Hudi has two operations to assist you in recovering -data from a previous state: "savepoint" and "restore". +data from a previous state: `savepoint` and `restore`. ## Savepoint -As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. -On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous -to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that -we can restore it later when in need. +As the name suggest, `savepoint` saves the table as of the commit time, so that it lets you restore the table to this +savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. +On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous +to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that +we can restore it later when in need. ## Restore -This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care +This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care should be taken before doing a restore. Hudi will delete all data files and commit files (timeline files) greater than the savepoint commit to which the table is being restored. You should pause all writes to the table when performing -a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries -will be hitting latest files which has high possibility of getting deleted with restore. +a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries +will be hitting latest files which has high possibility of getting deleted with restore. ## Runbook -Savepoint and restore can only be triggered from hudi-cli. Lets walk through an example of how one can take savepoint -and later restore the state of the table. +Savepoint and restore can only be triggered from `hudi-cli`. Let's walk through an example of how one can take savepoint +and later restore the state of the table. -Lets create a hudi table via spark-shell. I am going to trigger few batches of inserts. +Let's create a hudi table via `spark-shell` and trigger a batch of inserts. ```scala import org.apache.hudi.QuickstartUtils._ @@ -42,7 +42,6 @@ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -// spark-shell val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) df.write.format("hudi"). @@ -55,22 +54,23 @@ df.write.format("hudi"). save(basePath) ``` -Each batch inserst 10 records. Repeating for 4 more batches. +Let's add four more batches of inserts. ```scala - -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 4) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` -Total record count should be 50. +Total record count should be 50. ```scala val tripsSnapshotDF = spark. read. @@ -78,15 +78,22 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -Let's take a look at the timeline after 5 batch of inserts. + +:::danger Important: +If you're facing `java.lang.IllegalArgumentException: For input string: "null"` exception, it means that you may need to +manually set the `LEGACY_PARQUET_NANOS_AS_LONG` to `false` i.e. add `--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false'` +to your spark configuration while starting the spark session. For more information, read [here](https://github.com/apache/hudi/issues/8061). +::: + +Let's take a look at the timeline after 5 batches of inserts. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 128 @@ -109,7 +116,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit ``` -Let's trigger a savepoint as of the latest commit. Savepoint can only be done via hudi-cli. +Let's trigger a savepoint as of the latest commit. Savepoint can only be done via `hudi-cli`. ```sh ./hudi-cli.sh @@ -120,7 +127,11 @@ set --conf SPARK_HOME=<SPARK_HOME> savepoint create --commit 20220128160245447 --sparkMaster local[2] ``` -Let's check the timeline after savepoint. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest commit in your table. +::: + +Let's check the timeline after savepoint. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 136 @@ -145,21 +156,23 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint ``` -You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. +You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. + +Now, let's continue adding three more batches of inserts. -Now, lets continue adding few more batches of inserts. -Repeat below commands for 3 times. ```scala -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 3) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` Total record count will be 80 since we have done 8 batches in total. (5 until savepoint and 3 after savepoint) @@ -170,18 +183,18 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 80| - +--------------------------+ ++--------------------------+ ``` -Let's say something bad happened and you want to restore your table to a older snapshot. As we called out earlier, we can -trigger restore only from hudi-cli. And do remember to bring down all of your writer processes while doing a restore. +Let's say something bad happened, and you want to restore your table to an older snapshot. As we called out earlier, we can +trigger restore only from `hudi-cli`. And do remember to bring down all of your writer processes while doing a restore. -Lets checkout timeline once, before we trigger the restore. +Let's checkout timeline once, before we trigger the restore. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 208 @@ -215,8 +228,8 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit ``` -If you are continuing in the same hudi-cli session, you can just execute "refresh" so that table state gets refreshed to -its latest state. If not, connect to the table again. +If you are continuing in the same `hudi-cli` session, you can just execute `refresh` so that table state gets refreshed to +its latest state. If not, connect to the table again. ```shell ./hudi-cli.sh @@ -233,8 +246,12 @@ savepoints show savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2] ``` -Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have -been deleted. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest savepoint in your table. +::: + +Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have +been deleted. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 152 @@ -261,7 +278,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore ``` -Lets check the total record count in the table. Should match the records we had, just before we triggered the savepoint. +Let's check the total record count in the table. Should match the records we had, just before we triggered the savepoint. ```scala val tripsSnapshotDF = spark. read. @@ -269,28 +286,17 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint -at regular cadence and keep deleting older savepoints when new ones are created. Hudi-cli has a command "savepoint delete" -to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users -should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. +As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint +at regular cadence and keep deleting older savepoints when new ones are created. `hudi-cli` has a command `savepoint delete` +to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users +should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. Note: Savepoint and restore for MOR table is available only from 0.11. - - - - - - - - - - - diff --git a/website/versioned_docs/version-0.12.1/disaster_recovery.md b/website/versioned_docs/version-0.12.1/disaster_recovery.md index c2f53bc8cd7..0f7c198d99b 100644 --- a/website/versioned_docs/version-0.12.1/disaster_recovery.md +++ b/website/versioned_docs/version-0.12.1/disaster_recovery.md @@ -3,32 +3,32 @@ title: Disaster Recovery toc: true --- -Disaster Recovery is very much mission critical for any software. Especially when it comes to data systems, the impact could be very serious +Disaster Recovery is very much mission-critical for any software. Especially when it comes to data systems, the impact could be very serious leading to delay in business decisions or even wrong business decisions at times. Apache Hudi has two operations to assist you in recovering -data from a previous state: "savepoint" and "restore". +data from a previous state: `savepoint` and `restore`. ## Savepoint -As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. -On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous -to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that -we can restore it later when in need. +As the name suggest, `savepoint` saves the table as of the commit time, so that it lets you restore the table to this +savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. +On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous +to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that +we can restore it later when in need. ## Restore -This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care +This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care should be taken before doing a restore. Hudi will delete all data files and commit files (timeline files) greater than the savepoint commit to which the table is being restored. You should pause all writes to the table when performing -a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries -will be hitting latest files which has high possibility of getting deleted with restore. +a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries +will be hitting latest files which has high possibility of getting deleted with restore. ## Runbook -Savepoint and restore can only be triggered from hudi-cli. Lets walk through an example of how one can take savepoint -and later restore the state of the table. +Savepoint and restore can only be triggered from `hudi-cli`. Let's walk through an example of how one can take savepoint +and later restore the state of the table. -Lets create a hudi table via spark-shell. I am going to trigger few batches of inserts. +Let's create a hudi table via `spark-shell` and trigger a batch of inserts. ```scala import org.apache.hudi.QuickstartUtils._ @@ -42,7 +42,6 @@ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -// spark-shell val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) df.write.format("hudi"). @@ -55,22 +54,23 @@ df.write.format("hudi"). save(basePath) ``` -Each batch inserst 10 records. Repeating for 4 more batches. +Let's add four more batches of inserts. ```scala - -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 4) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` -Total record count should be 50. +Total record count should be 50. ```scala val tripsSnapshotDF = spark. read. @@ -78,15 +78,22 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -Let's take a look at the timeline after 5 batch of inserts. + +:::danger Important: +If you're facing `java.lang.IllegalArgumentException: For input string: "null"` exception, it means that you may need to +manually set the `LEGACY_PARQUET_NANOS_AS_LONG` to `false` i.e. add `--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false'` +to your spark configuration while starting the spark session. For more information, read [here](https://github.com/apache/hudi/issues/8061). +::: + +Let's take a look at the timeline after 5 batches of inserts. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 128 @@ -109,7 +116,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit ``` -Let's trigger a savepoint as of the latest commit. Savepoint can only be done via hudi-cli. +Let's trigger a savepoint as of the latest commit. Savepoint can only be done via `hudi-cli`. ```sh ./hudi-cli.sh @@ -120,7 +127,11 @@ set --conf SPARK_HOME=<SPARK_HOME> savepoint create --commit 20220128160245447 --sparkMaster local[2] ``` -Let's check the timeline after savepoint. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest commit in your table. +::: + +Let's check the timeline after savepoint. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 136 @@ -145,21 +156,23 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint ``` -You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. +You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. + +Now, let's continue adding three more batches of inserts. -Now, lets continue adding few more batches of inserts. -Repeat below commands for 3 times. ```scala -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 3) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` Total record count will be 80 since we have done 8 batches in total. (5 until savepoint and 3 after savepoint) @@ -170,18 +183,18 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 80| - +--------------------------+ ++--------------------------+ ``` -Let's say something bad happened and you want to restore your table to a older snapshot. As we called out earlier, we can -trigger restore only from hudi-cli. And do remember to bring down all of your writer processes while doing a restore. +Let's say something bad happened, and you want to restore your table to an older snapshot. As we called out earlier, we can +trigger restore only from `hudi-cli`. And do remember to bring down all of your writer processes while doing a restore. -Lets checkout timeline once, before we trigger the restore. +Let's checkout timeline once, before we trigger the restore. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 208 @@ -215,8 +228,8 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit ``` -If you are continuing in the same hudi-cli session, you can just execute "refresh" so that table state gets refreshed to -its latest state. If not, connect to the table again. +If you are continuing in the same `hudi-cli` session, you can just execute `refresh` so that table state gets refreshed to +its latest state. If not, connect to the table again. ```shell ./hudi-cli.sh @@ -233,8 +246,12 @@ savepoints show savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2] ``` -Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have -been deleted. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest savepoint in your table. +::: + +Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have +been deleted. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 152 @@ -261,7 +278,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore ``` -Lets check the total record count in the table. Should match the records we had, just before we triggered the savepoint. +Let's check the total record count in the table. Should match the records we had, just before we triggered the savepoint. ```scala val tripsSnapshotDF = spark. read. @@ -269,28 +286,17 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint -at regular cadence and keep deleting older savepoints when new ones are created. Hudi-cli has a command "savepoint delete" -to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users -should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. +As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint +at regular cadence and keep deleting older savepoints when new ones are created. `hudi-cli` has a command `savepoint delete` +to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users +should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. Note: Savepoint and restore for MOR table is available only from 0.11. - - - - - - - - - - - diff --git a/website/versioned_docs/version-0.12.2/disaster_recovery.md b/website/versioned_docs/version-0.12.2/disaster_recovery.md index c2f53bc8cd7..0f7c198d99b 100644 --- a/website/versioned_docs/version-0.12.2/disaster_recovery.md +++ b/website/versioned_docs/version-0.12.2/disaster_recovery.md @@ -3,32 +3,32 @@ title: Disaster Recovery toc: true --- -Disaster Recovery is very much mission critical for any software. Especially when it comes to data systems, the impact could be very serious +Disaster Recovery is very much mission-critical for any software. Especially when it comes to data systems, the impact could be very serious leading to delay in business decisions or even wrong business decisions at times. Apache Hudi has two operations to assist you in recovering -data from a previous state: "savepoint" and "restore". +data from a previous state: `savepoint` and `restore`. ## Savepoint -As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. -On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous -to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that -we can restore it later when in need. +As the name suggest, `savepoint` saves the table as of the commit time, so that it lets you restore the table to this +savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. +On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous +to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that +we can restore it later when in need. ## Restore -This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care +This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care should be taken before doing a restore. Hudi will delete all data files and commit files (timeline files) greater than the savepoint commit to which the table is being restored. You should pause all writes to the table when performing -a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries -will be hitting latest files which has high possibility of getting deleted with restore. +a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries +will be hitting latest files which has high possibility of getting deleted with restore. ## Runbook -Savepoint and restore can only be triggered from hudi-cli. Lets walk through an example of how one can take savepoint -and later restore the state of the table. +Savepoint and restore can only be triggered from `hudi-cli`. Let's walk through an example of how one can take savepoint +and later restore the state of the table. -Lets create a hudi table via spark-shell. I am going to trigger few batches of inserts. +Let's create a hudi table via `spark-shell` and trigger a batch of inserts. ```scala import org.apache.hudi.QuickstartUtils._ @@ -42,7 +42,6 @@ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -// spark-shell val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) df.write.format("hudi"). @@ -55,22 +54,23 @@ df.write.format("hudi"). save(basePath) ``` -Each batch inserst 10 records. Repeating for 4 more batches. +Let's add four more batches of inserts. ```scala - -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 4) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` -Total record count should be 50. +Total record count should be 50. ```scala val tripsSnapshotDF = spark. read. @@ -78,15 +78,22 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -Let's take a look at the timeline after 5 batch of inserts. + +:::danger Important: +If you're facing `java.lang.IllegalArgumentException: For input string: "null"` exception, it means that you may need to +manually set the `LEGACY_PARQUET_NANOS_AS_LONG` to `false` i.e. add `--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false'` +to your spark configuration while starting the spark session. For more information, read [here](https://github.com/apache/hudi/issues/8061). +::: + +Let's take a look at the timeline after 5 batches of inserts. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 128 @@ -109,7 +116,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit ``` -Let's trigger a savepoint as of the latest commit. Savepoint can only be done via hudi-cli. +Let's trigger a savepoint as of the latest commit. Savepoint can only be done via `hudi-cli`. ```sh ./hudi-cli.sh @@ -120,7 +127,11 @@ set --conf SPARK_HOME=<SPARK_HOME> savepoint create --commit 20220128160245447 --sparkMaster local[2] ``` -Let's check the timeline after savepoint. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest commit in your table. +::: + +Let's check the timeline after savepoint. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 136 @@ -145,21 +156,23 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint ``` -You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. +You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. + +Now, let's continue adding three more batches of inserts. -Now, lets continue adding few more batches of inserts. -Repeat below commands for 3 times. ```scala -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 3) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` Total record count will be 80 since we have done 8 batches in total. (5 until savepoint and 3 after savepoint) @@ -170,18 +183,18 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 80| - +--------------------------+ ++--------------------------+ ``` -Let's say something bad happened and you want to restore your table to a older snapshot. As we called out earlier, we can -trigger restore only from hudi-cli. And do remember to bring down all of your writer processes while doing a restore. +Let's say something bad happened, and you want to restore your table to an older snapshot. As we called out earlier, we can +trigger restore only from `hudi-cli`. And do remember to bring down all of your writer processes while doing a restore. -Lets checkout timeline once, before we trigger the restore. +Let's checkout timeline once, before we trigger the restore. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 208 @@ -215,8 +228,8 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit ``` -If you are continuing in the same hudi-cli session, you can just execute "refresh" so that table state gets refreshed to -its latest state. If not, connect to the table again. +If you are continuing in the same `hudi-cli` session, you can just execute `refresh` so that table state gets refreshed to +its latest state. If not, connect to the table again. ```shell ./hudi-cli.sh @@ -233,8 +246,12 @@ savepoints show savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2] ``` -Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have -been deleted. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest savepoint in your table. +::: + +Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have +been deleted. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 152 @@ -261,7 +278,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore ``` -Lets check the total record count in the table. Should match the records we had, just before we triggered the savepoint. +Let's check the total record count in the table. Should match the records we had, just before we triggered the savepoint. ```scala val tripsSnapshotDF = spark. read. @@ -269,28 +286,17 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint -at regular cadence and keep deleting older savepoints when new ones are created. Hudi-cli has a command "savepoint delete" -to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users -should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. +As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint +at regular cadence and keep deleting older savepoints when new ones are created. `hudi-cli` has a command `savepoint delete` +to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users +should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. Note: Savepoint and restore for MOR table is available only from 0.11. - - - - - - - - - - - diff --git a/website/versioned_docs/version-0.12.3/disaster_recovery.md b/website/versioned_docs/version-0.12.3/disaster_recovery.md index c2f53bc8cd7..0f7c198d99b 100644 --- a/website/versioned_docs/version-0.12.3/disaster_recovery.md +++ b/website/versioned_docs/version-0.12.3/disaster_recovery.md @@ -3,32 +3,32 @@ title: Disaster Recovery toc: true --- -Disaster Recovery is very much mission critical for any software. Especially when it comes to data systems, the impact could be very serious +Disaster Recovery is very much mission-critical for any software. Especially when it comes to data systems, the impact could be very serious leading to delay in business decisions or even wrong business decisions at times. Apache Hudi has two operations to assist you in recovering -data from a previous state: "savepoint" and "restore". +data from a previous state: `savepoint` and `restore`. ## Savepoint -As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. -On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous -to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that -we can restore it later when in need. +As the name suggest, `savepoint` saves the table as of the commit time, so that it lets you restore the table to this +savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. +On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous +to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that +we can restore it later when in need. ## Restore -This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care +This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care should be taken before doing a restore. Hudi will delete all data files and commit files (timeline files) greater than the savepoint commit to which the table is being restored. You should pause all writes to the table when performing -a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries -will be hitting latest files which has high possibility of getting deleted with restore. +a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries +will be hitting latest files which has high possibility of getting deleted with restore. ## Runbook -Savepoint and restore can only be triggered from hudi-cli. Lets walk through an example of how one can take savepoint -and later restore the state of the table. +Savepoint and restore can only be triggered from `hudi-cli`. Let's walk through an example of how one can take savepoint +and later restore the state of the table. -Lets create a hudi table via spark-shell. I am going to trigger few batches of inserts. +Let's create a hudi table via `spark-shell` and trigger a batch of inserts. ```scala import org.apache.hudi.QuickstartUtils._ @@ -42,7 +42,6 @@ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -// spark-shell val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) df.write.format("hudi"). @@ -55,22 +54,23 @@ df.write.format("hudi"). save(basePath) ``` -Each batch inserst 10 records. Repeating for 4 more batches. +Let's add four more batches of inserts. ```scala - -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 4) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` -Total record count should be 50. +Total record count should be 50. ```scala val tripsSnapshotDF = spark. read. @@ -78,15 +78,22 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -Let's take a look at the timeline after 5 batch of inserts. + +:::danger Important: +If you're facing `java.lang.IllegalArgumentException: For input string: "null"` exception, it means that you may need to +manually set the `LEGACY_PARQUET_NANOS_AS_LONG` to `false` i.e. add `--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false'` +to your spark configuration while starting the spark session. For more information, read [here](https://github.com/apache/hudi/issues/8061). +::: + +Let's take a look at the timeline after 5 batches of inserts. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 128 @@ -109,7 +116,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit ``` -Let's trigger a savepoint as of the latest commit. Savepoint can only be done via hudi-cli. +Let's trigger a savepoint as of the latest commit. Savepoint can only be done via `hudi-cli`. ```sh ./hudi-cli.sh @@ -120,7 +127,11 @@ set --conf SPARK_HOME=<SPARK_HOME> savepoint create --commit 20220128160245447 --sparkMaster local[2] ``` -Let's check the timeline after savepoint. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest commit in your table. +::: + +Let's check the timeline after savepoint. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 136 @@ -145,21 +156,23 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint ``` -You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. +You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. + +Now, let's continue adding three more batches of inserts. -Now, lets continue adding few more batches of inserts. -Repeat below commands for 3 times. ```scala -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 3) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` Total record count will be 80 since we have done 8 batches in total. (5 until savepoint and 3 after savepoint) @@ -170,18 +183,18 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 80| - +--------------------------+ ++--------------------------+ ``` -Let's say something bad happened and you want to restore your table to a older snapshot. As we called out earlier, we can -trigger restore only from hudi-cli. And do remember to bring down all of your writer processes while doing a restore. +Let's say something bad happened, and you want to restore your table to an older snapshot. As we called out earlier, we can +trigger restore only from `hudi-cli`. And do remember to bring down all of your writer processes while doing a restore. -Lets checkout timeline once, before we trigger the restore. +Let's checkout timeline once, before we trigger the restore. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 208 @@ -215,8 +228,8 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit ``` -If you are continuing in the same hudi-cli session, you can just execute "refresh" so that table state gets refreshed to -its latest state. If not, connect to the table again. +If you are continuing in the same `hudi-cli` session, you can just execute `refresh` so that table state gets refreshed to +its latest state. If not, connect to the table again. ```shell ./hudi-cli.sh @@ -233,8 +246,12 @@ savepoints show savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2] ``` -Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have -been deleted. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest savepoint in your table. +::: + +Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have +been deleted. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 152 @@ -261,7 +278,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore ``` -Lets check the total record count in the table. Should match the records we had, just before we triggered the savepoint. +Let's check the total record count in the table. Should match the records we had, just before we triggered the savepoint. ```scala val tripsSnapshotDF = spark. read. @@ -269,28 +286,17 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint -at regular cadence and keep deleting older savepoints when new ones are created. Hudi-cli has a command "savepoint delete" -to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users -should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. +As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint +at regular cadence and keep deleting older savepoints when new ones are created. `hudi-cli` has a command `savepoint delete` +to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users +should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. Note: Savepoint and restore for MOR table is available only from 0.11. - - - - - - - - - - - diff --git a/website/versioned_docs/version-0.13.0/disaster_recovery.md b/website/versioned_docs/version-0.13.0/disaster_recovery.md index c2f53bc8cd7..0f7c198d99b 100644 --- a/website/versioned_docs/version-0.13.0/disaster_recovery.md +++ b/website/versioned_docs/version-0.13.0/disaster_recovery.md @@ -3,32 +3,32 @@ title: Disaster Recovery toc: true --- -Disaster Recovery is very much mission critical for any software. Especially when it comes to data systems, the impact could be very serious +Disaster Recovery is very much mission-critical for any software. Especially when it comes to data systems, the impact could be very serious leading to delay in business decisions or even wrong business decisions at times. Apache Hudi has two operations to assist you in recovering -data from a previous state: "savepoint" and "restore". +data from a previous state: `savepoint` and `restore`. ## Savepoint -As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. -On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous -to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that -we can restore it later when in need. +As the name suggest, `savepoint` saves the table as of the commit time, so that it lets you restore the table to this +savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. +On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous +to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that +we can restore it later when in need. ## Restore -This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care +This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care should be taken before doing a restore. Hudi will delete all data files and commit files (timeline files) greater than the savepoint commit to which the table is being restored. You should pause all writes to the table when performing -a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries -will be hitting latest files which has high possibility of getting deleted with restore. +a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries +will be hitting latest files which has high possibility of getting deleted with restore. ## Runbook -Savepoint and restore can only be triggered from hudi-cli. Lets walk through an example of how one can take savepoint -and later restore the state of the table. +Savepoint and restore can only be triggered from `hudi-cli`. Let's walk through an example of how one can take savepoint +and later restore the state of the table. -Lets create a hudi table via spark-shell. I am going to trigger few batches of inserts. +Let's create a hudi table via `spark-shell` and trigger a batch of inserts. ```scala import org.apache.hudi.QuickstartUtils._ @@ -42,7 +42,6 @@ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -// spark-shell val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) df.write.format("hudi"). @@ -55,22 +54,23 @@ df.write.format("hudi"). save(basePath) ``` -Each batch inserst 10 records. Repeating for 4 more batches. +Let's add four more batches of inserts. ```scala - -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 4) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` -Total record count should be 50. +Total record count should be 50. ```scala val tripsSnapshotDF = spark. read. @@ -78,15 +78,22 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -Let's take a look at the timeline after 5 batch of inserts. + +:::danger Important: +If you're facing `java.lang.IllegalArgumentException: For input string: "null"` exception, it means that you may need to +manually set the `LEGACY_PARQUET_NANOS_AS_LONG` to `false` i.e. add `--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false'` +to your spark configuration while starting the spark session. For more information, read [here](https://github.com/apache/hudi/issues/8061). +::: + +Let's take a look at the timeline after 5 batches of inserts. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 128 @@ -109,7 +116,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit ``` -Let's trigger a savepoint as of the latest commit. Savepoint can only be done via hudi-cli. +Let's trigger a savepoint as of the latest commit. Savepoint can only be done via `hudi-cli`. ```sh ./hudi-cli.sh @@ -120,7 +127,11 @@ set --conf SPARK_HOME=<SPARK_HOME> savepoint create --commit 20220128160245447 --sparkMaster local[2] ``` -Let's check the timeline after savepoint. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest commit in your table. +::: + +Let's check the timeline after savepoint. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 136 @@ -145,21 +156,23 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint ``` -You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. +You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. + +Now, let's continue adding three more batches of inserts. -Now, lets continue adding few more batches of inserts. -Repeat below commands for 3 times. ```scala -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 3) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` Total record count will be 80 since we have done 8 batches in total. (5 until savepoint and 3 after savepoint) @@ -170,18 +183,18 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 80| - +--------------------------+ ++--------------------------+ ``` -Let's say something bad happened and you want to restore your table to a older snapshot. As we called out earlier, we can -trigger restore only from hudi-cli. And do remember to bring down all of your writer processes while doing a restore. +Let's say something bad happened, and you want to restore your table to an older snapshot. As we called out earlier, we can +trigger restore only from `hudi-cli`. And do remember to bring down all of your writer processes while doing a restore. -Lets checkout timeline once, before we trigger the restore. +Let's checkout timeline once, before we trigger the restore. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 208 @@ -215,8 +228,8 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit ``` -If you are continuing in the same hudi-cli session, you can just execute "refresh" so that table state gets refreshed to -its latest state. If not, connect to the table again. +If you are continuing in the same `hudi-cli` session, you can just execute `refresh` so that table state gets refreshed to +its latest state. If not, connect to the table again. ```shell ./hudi-cli.sh @@ -233,8 +246,12 @@ savepoints show savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2] ``` -Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have -been deleted. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest savepoint in your table. +::: + +Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have +been deleted. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 152 @@ -261,7 +278,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore ``` -Lets check the total record count in the table. Should match the records we had, just before we triggered the savepoint. +Let's check the total record count in the table. Should match the records we had, just before we triggered the savepoint. ```scala val tripsSnapshotDF = spark. read. @@ -269,28 +286,17 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint -at regular cadence and keep deleting older savepoints when new ones are created. Hudi-cli has a command "savepoint delete" -to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users -should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. +As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint +at regular cadence and keep deleting older savepoints when new ones are created. `hudi-cli` has a command `savepoint delete` +to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users +should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. Note: Savepoint and restore for MOR table is available only from 0.11. - - - - - - - - - - - diff --git a/website/versioned_docs/version-0.13.1/disaster_recovery.md b/website/versioned_docs/version-0.13.1/disaster_recovery.md index c2f53bc8cd7..b95085d358b 100644 --- a/website/versioned_docs/version-0.13.1/disaster_recovery.md +++ b/website/versioned_docs/version-0.13.1/disaster_recovery.md @@ -3,32 +3,32 @@ title: Disaster Recovery toc: true --- -Disaster Recovery is very much mission critical for any software. Especially when it comes to data systems, the impact could be very serious +Disaster Recovery is very much mission-critical for any software. Especially when it comes to data systems, the impact could be very serious leading to delay in business decisions or even wrong business decisions at times. Apache Hudi has two operations to assist you in recovering -data from a previous state: "savepoint" and "restore". +data from a previous state: `savepoint` and `restore`. ## Savepoint -As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. -On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous -to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that -we can restore it later when in need. +As the name suggest, `savepoint` saves the table as of the commit time, so that it lets you restore the table to this +savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. +On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous +to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that +we can restore it later when in need. ## Restore -This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care +This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care should be taken before doing a restore. Hudi will delete all data files and commit files (timeline files) greater than the savepoint commit to which the table is being restored. You should pause all writes to the table when performing -a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries -will be hitting latest files which has high possibility of getting deleted with restore. +a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries +will be hitting latest files which has high possibility of getting deleted with restore. ## Runbook -Savepoint and restore can only be triggered from hudi-cli. Lets walk through an example of how one can take savepoint -and later restore the state of the table. +Savepoint and restore can only be triggered from `hudi-cli`. Let's walk through an example of how one can take savepoint +and later restore the state of the table. -Lets create a hudi table via spark-shell. I am going to trigger few batches of inserts. +Let's create a hudi table via `spark-shell` and trigger a batch of inserts. ```scala import org.apache.hudi.QuickstartUtils._ @@ -42,7 +42,6 @@ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -// spark-shell val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) df.write.format("hudi"). @@ -55,22 +54,23 @@ df.write.format("hudi"). save(basePath) ``` -Each batch inserst 10 records. Repeating for 4 more batches. +Let's add four more batches of inserts. ```scala - -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 4) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` -Total record count should be 50. +Total record count should be 50. ```scala val tripsSnapshotDF = spark. read. @@ -78,15 +78,15 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -Let's take a look at the timeline after 5 batch of inserts. +Let's take a look at the timeline after 5 batches of inserts. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 128 @@ -109,7 +109,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit ``` -Let's trigger a savepoint as of the latest commit. Savepoint can only be done via hudi-cli. +Let's trigger a savepoint as of the latest commit. Savepoint can only be done via `hudi-cli`. ```sh ./hudi-cli.sh @@ -120,7 +120,11 @@ set --conf SPARK_HOME=<SPARK_HOME> savepoint create --commit 20220128160245447 --sparkMaster local[2] ``` -Let's check the timeline after savepoint. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest commit in your table. +::: + +Let's check the timeline after savepoint. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 136 @@ -145,21 +149,23 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint ``` -You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. +You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. + +Now, let's continue adding three more batches of inserts. -Now, lets continue adding few more batches of inserts. -Repeat below commands for 3 times. ```scala -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 3) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` Total record count will be 80 since we have done 8 batches in total. (5 until savepoint and 3 after savepoint) @@ -170,18 +176,18 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 80| - +--------------------------+ ++--------------------------+ ``` -Let's say something bad happened and you want to restore your table to a older snapshot. As we called out earlier, we can -trigger restore only from hudi-cli. And do remember to bring down all of your writer processes while doing a restore. +Let's say something bad happened, and you want to restore your table to an older snapshot. As we called out earlier, we can +trigger restore only from `hudi-cli`. And do remember to bring down all of your writer processes while doing a restore. -Lets checkout timeline once, before we trigger the restore. +Let's checkout timeline once, before we trigger the restore. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 208 @@ -215,8 +221,8 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit ``` -If you are continuing in the same hudi-cli session, you can just execute "refresh" so that table state gets refreshed to -its latest state. If not, connect to the table again. +If you are continuing in the same `hudi-cli` session, you can just execute `refresh` so that table state gets refreshed to +its latest state. If not, connect to the table again. ```shell ./hudi-cli.sh @@ -233,8 +239,12 @@ savepoints show savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2] ``` -Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have -been deleted. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest savepoint in your table. +::: + +Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have +been deleted. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 152 @@ -261,7 +271,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore ``` -Lets check the total record count in the table. Should match the records we had, just before we triggered the savepoint. +Let's check the total record count in the table. Should match the records we had, just before we triggered the savepoint. ```scala val tripsSnapshotDF = spark. read. @@ -269,28 +279,17 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint -at regular cadence and keep deleting older savepoints when new ones are created. Hudi-cli has a command "savepoint delete" -to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users -should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. +As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint +at regular cadence and keep deleting older savepoints when new ones are created. `hudi-cli` has a command `savepoint delete` +to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users +should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. Note: Savepoint and restore for MOR table is available only from 0.11. - - - - - - - - - - - diff --git a/website/versioned_docs/version-0.14.0/disaster_recovery.md b/website/versioned_docs/version-0.14.0/disaster_recovery.md index c2f53bc8cd7..b95085d358b 100644 --- a/website/versioned_docs/version-0.14.0/disaster_recovery.md +++ b/website/versioned_docs/version-0.14.0/disaster_recovery.md @@ -3,32 +3,32 @@ title: Disaster Recovery toc: true --- -Disaster Recovery is very much mission critical for any software. Especially when it comes to data systems, the impact could be very serious +Disaster Recovery is very much mission-critical for any software. Especially when it comes to data systems, the impact could be very serious leading to delay in business decisions or even wrong business decisions at times. Apache Hudi has two operations to assist you in recovering -data from a previous state: "savepoint" and "restore". +data from a previous state: `savepoint` and `restore`. ## Savepoint -As the name suggest, "savepoint" saves the table as of the commit time, so that it lets you restore the table to this -savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. -On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous -to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that -we can restore it later when in need. +As the name suggest, `savepoint` saves the table as of the commit time, so that it lets you restore the table to this +savepoint at a later point in time if need be. Care is taken to ensure cleaner will not clean up any files that are savepointed. +On similar lines, savepoint cannot be triggered on a commit that is already cleaned up. In simpler terms, this is synonymous +to taking a backup, just that we don't make a new copy of the table, but just save the state of the table elegantly so that +we can restore it later when in need. ## Restore -This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care +This operation lets you restore your table to one of the savepoint commit. This operation cannot be undone (or reversed) and so care should be taken before doing a restore. Hudi will delete all data files and commit files (timeline files) greater than the savepoint commit to which the table is being restored. You should pause all writes to the table when performing -a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries -will be hitting latest files which has high possibility of getting deleted with restore. +a restore since they are likely to fail while the restore is in progress. Also, reads could also fail since snapshot queries +will be hitting latest files which has high possibility of getting deleted with restore. ## Runbook -Savepoint and restore can only be triggered from hudi-cli. Lets walk through an example of how one can take savepoint -and later restore the state of the table. +Savepoint and restore can only be triggered from `hudi-cli`. Let's walk through an example of how one can take savepoint +and later restore the state of the table. -Lets create a hudi table via spark-shell. I am going to trigger few batches of inserts. +Let's create a hudi table via `spark-shell` and trigger a batch of inserts. ```scala import org.apache.hudi.QuickstartUtils._ @@ -42,7 +42,6 @@ val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -// spark-shell val inserts = convertToStringList(dataGen.generateInserts(10)) val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) df.write.format("hudi"). @@ -55,22 +54,23 @@ df.write.format("hudi"). save(basePath) ``` -Each batch inserst 10 records. Repeating for 4 more batches. +Let's add four more batches of inserts. ```scala - -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 4) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` -Total record count should be 50. +Total record count should be 50. ```scala val tripsSnapshotDF = spark. read. @@ -78,15 +78,15 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -Let's take a look at the timeline after 5 batch of inserts. +Let's take a look at the timeline after 5 batches of inserts. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 128 @@ -109,7 +109,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:02 20220128160245447.commit ``` -Let's trigger a savepoint as of the latest commit. Savepoint can only be done via hudi-cli. +Let's trigger a savepoint as of the latest commit. Savepoint can only be done via `hudi-cli`. ```sh ./hudi-cli.sh @@ -120,7 +120,11 @@ set --conf SPARK_HOME=<SPARK_HOME> savepoint create --commit 20220128160245447 --sparkMaster local[2] ``` -Let's check the timeline after savepoint. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest commit in your table. +::: + +Let's check the timeline after savepoint. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 136 @@ -145,21 +149,23 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 1168 Jan 28 16:05 20220128160245447.savepoint ``` -You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. +You could notice that savepoint meta files are added which keeps track of the files that are part of the latest table snapshot. + +Now, let's continue adding three more batches of inserts. -Now, lets continue adding few more batches of inserts. -Repeat below commands for 3 times. ```scala -val inserts = convertToStringList(dataGen.generateInserts(10)) -val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) -df.write.format("hudi"). - options(getQuickstartWriteConfigs). - option(PRECOMBINE_FIELD_OPT_KEY, "ts"). - option(RECORDKEY_FIELD_OPT_KEY, "uuid"). - option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). - option(TABLE_NAME, tableName). - mode(Append). - save(basePath) +for (_ <- 1 to 3) { + val inserts = convertToStringList(dataGen.generateInserts(10)) + val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) + df.write.format("hudi"). + options(getQuickstartWriteConfigs). + option(PRECOMBINE_FIELD_OPT_KEY, "ts"). + option(RECORDKEY_FIELD_OPT_KEY, "uuid"). + option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). + option(TABLE_NAME, tableName). + mode(Append). + save(basePath) +} ``` Total record count will be 80 since we have done 8 batches in total. (5 until savepoint and 3 after savepoint) @@ -170,18 +176,18 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 80| - +--------------------------+ ++--------------------------+ ``` -Let's say something bad happened and you want to restore your table to a older snapshot. As we called out earlier, we can -trigger restore only from hudi-cli. And do remember to bring down all of your writer processes while doing a restore. +Let's say something bad happened, and you want to restore your table to an older snapshot. As we called out earlier, we can +trigger restore only from `hudi-cli`. And do remember to bring down all of your writer processes while doing a restore. -Lets checkout timeline once, before we trigger the restore. +Let's checkout timeline once, before we trigger the restore. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 208 @@ -215,8 +221,8 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4428 Jan 28 16:06 20220128160630785.commit ``` -If you are continuing in the same hudi-cli session, you can just execute "refresh" so that table state gets refreshed to -its latest state. If not, connect to the table again. +If you are continuing in the same `hudi-cli` session, you can just execute `refresh` so that table state gets refreshed to +its latest state. If not, connect to the table again. ```shell ./hudi-cli.sh @@ -233,8 +239,12 @@ savepoints show savepoint rollback --savepoint 20220128160245447 --sparkMaster local[2] ``` -Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have -been deleted. +:::note NOTE: +Make sure you replace 20220128160245447 with the latest savepoint in your table. +::: + +Hudi table should have been restored to the savepointed commit 20220128160245447. Both data files and timeline files should have +been deleted. ```shell ls -ltr /tmp/hudi_trips_cow/.hoodie total 152 @@ -261,7 +271,7 @@ drwxr-xr-x 2 nsb wheel 64 Jan 28 16:00 archived -rw-r--r-- 1 nsb wheel 4152 Jan 28 16:07 20220128160732437.restore ``` -Lets check the total record count in the table. Should match the records we had, just before we triggered the savepoint. +Let's check the total record count in the table. Should match the records we had, just before we triggered the savepoint. ```scala val tripsSnapshotDF = spark. read. @@ -269,28 +279,17 @@ val tripsSnapshotDF = spark. load(basePath) tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") -spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot ").show() +spark.sql("select count(partitionpath, uuid) from hudi_trips_snapshot").show() +--------------------------+ |count(partitionpath, uuid)| - +--------------------------+ ++--------------------------+ | 50| - +--------------------------+ ++--------------------------+ ``` -As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint -at regular cadence and keep deleting older savepoints when new ones are created. Hudi-cli has a command "savepoint delete" -to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users -should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. +As you could see, entire table state is restored back to the commit which was savepointed. Users can choose to trigger savepoint +at regular cadence and keep deleting older savepoints when new ones are created. `hudi-cli` has a command `savepoint delete` +to assist in deleting a savepoint. Please do remember that cleaner may not clean the files that are savepointed. And so users +should ensure they delete the savepoints from time to time. If not, the storage reclamation may not happen. Note: Savepoint and restore for MOR table is available only from 0.11. - - - - - - - - - - -