This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 3bd7255d40c [DOCS] Updating FAQ around concepts, based on 0.14 release (#9824) 3bd7255d40c is described below commit 3bd7255d40cfc5d4a7b2f21028fd0b60b6af2790 Author: vinoth chandar <vinothchan...@users.noreply.github.com> AuthorDate: Thu Oct 5 11:39:24 2023 -0700 [DOCS] Updating FAQ around concepts, based on 0.14 release (#9824) --- website/docs/faq.md | 32 ++++++++++++++++++++++++---- website/versioned_docs/version-0.14.0/faq.md | 32 ++++++++++++++++++++++++---- 2 files changed, 56 insertions(+), 8 deletions(-) diff --git a/website/docs/faq.md b/website/docs/faq.md index a58d32c6ad9..98b3853d70d 100644 --- a/website/docs/faq.md +++ b/website/docs/faq.md @@ -120,13 +120,37 @@ Hudi provides snapshot isolation between all three types of processes - writers, ### Hudi’s commits are based on transaction start time instead of completed time. Does this cause data loss or inconsistency in case of incremental and time travel queries? -Let’s take a closer look at the scenario here: two commits C1 and C2 (with C2 starting later than C1) start with a later commit (C2) finishing first leaving the inflight transaction of the earlier commit (C1) before the completed write of the later transaction (C2) in Hudi’s timeline. This is not an uncommon scenario, especially with various ingestions needs such as backfilling, deleting, bootstrapping, etc alongside regular writes. When/Whether the first job would commit will depend on [...] - -In these scenarios, it might be tempting to think of data inconsistencies/data loss when using Hudi’s incremental queries. However, Hudi takes special handling in incremental queries to ensure that no data is served beyond the point Hudi sees an inflight instant in its timeline, so no data loss or drop happens. In this case, on seeing C1’s inflight commit (publish to timeline is atomic), C2 data (which is > C1 in the timeline) is not served until C1 inflight transitions to a terminal sta [...] +Let’s take a closer look at the scenario here: two commits C1 and C2 (with C2 starting later than C1) start with a later commit (C2) finishing first leaving the inflight transaction of the earlier commit (C1) +before the completed write of the later transaction (C2) in Hudi’s timeline. This is not an uncommon scenario, especially with various ingestions needs such as backfilling, deleting, bootstrapping, etc +alongside regular writes. When/Whether the first job would commit will depend on factors such as conflicts between concurrent commits, inflight compactions, other actions on the table’s timeline etc. +If the first job fails for some reason, Hudi will abort the earlier commit inflight (c1) and the writer has to retry next time with a new instant time > c2 much similar to other OCC implementations. +Firstly, for snapshot queries the order of commits should not matter at all, since any incomplete writes on the active timeline is ignored by queries and cause no side-effects. + +In these scenarios, it might be tempting to think of data inconsistencies/data loss when using Hudi’s incremental queries. However, Hudi takes special handling +(examples [1](https://github.com/apache/hudi/blob/aea5bb6f0ab824247f5e3498762ad94f643a2cb6/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java#L76), +[2](https://github.com/apache/hudi/blame/7a6543958368540d221ddc18e0c12b8d526b6859/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieInputFormatUtils.java#L173)) in incremental queries to ensure that no data +is served beyond the point there is an inflight instant in its timeline, so no data loss or drop happens. This detection is made possible because Hudi writes first request a transaction on the timeline, before planning/executing +the write, as explained in the [timeline](https://hudi.apache.org/docs/timeline#states) section. + +In this case, on seeing C1’s inflight commit (publish to timeline is atomic), C2 data (which is > C1 in the timeline) is not served until C1 inflight transitions to a terminal state such as completed or marked as failed. +This [test](https://github.com/apache/hudi/blob/master/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestHoodieIncrSource.java#L137) demonstrates how Hudi incremental source stops proceeding until C1 completes. +Hudi favors [safety and sacrifices liveness](https://en.wikipedia.org/wiki/Safety_and_liveness_properties), in such a case. For a single writer, the start times of the transactions are the same as the order of completion of transactions, and both incremental and time-travel queries work as expected. +In the case of multi-writer, incremental queries still work as expected but time travel queries don't. Since most time travel queries are on historical snapshots with a stable continuous timeline, this has not been implemented upto Hudi 0.13. +However, a similar approach like above can be easily applied to failing time travel queries as well in this window. ### How does Hudi plan to address the liveness issue above for incremental queries? -Hudi has had a proposal to streamline/improve this experience by adding a transition-time to our timeline, which will remove the [liveness sacrifice](https://en.wikipedia.org/wiki/Safety_and_liveness_properties) currently being made and makes it easier to understand. This has been delayed for a few reasons (a) Large hosted query engines and users not upgrading fast enough. (b) the issues brought up - \[[1](https://hudi.apache.org/docs/next/faq#does-hudis-use-of-wall-clock-timestamp-for-i [...] +Hudi 0.14 improves the liveness aspects by enabling change streams, incremental query and time-travel based on the file/object's timestamp (similar to [Delta Lake](https://docs.delta.io/latest/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel)). + +To expand more on the long term approach, Hudi has had a proposal to streamline/improve this experience by adding a transition-time to our timeline, which will remove the [liveness sacrifice](https://en.wikipedia.org/wiki/Safety_and_liveness_properties) and makes it easier to understand. +This has been delayed for a few reasons + +- Large hosted query engines and users not upgrading fast enough. +- The issues brought up - \[[1](https://hudi.apache.org/docs/next/faq#does-hudis-use-of-wall-clock-timestamp-for-instants-pose-any-clock-skew-issues),[2](https://hudi.apache.org/docs/next/faq#hudis-commits-are-based-on-transaction-start-time-instead-of-completed-time-does-this-cause-data-loss-or-inconsistency-in-case-of-incremental-and-time-travel-queries)\], +relevant to this are not practically very important to users beyond good pedantic discussions, +- Wanting to do it alongside [non-blocking concurrency control](https://github.com/apache/hudi/pull/7907) in Hudi version 1.x. + +It's planned to be addressed in the first 1.x release. ### Does Hudi’s use of wall clock timestamp for instants pose any clock skew issues? diff --git a/website/versioned_docs/version-0.14.0/faq.md b/website/versioned_docs/version-0.14.0/faq.md index a58d32c6ad9..3d24b0c2d60 100644 --- a/website/versioned_docs/version-0.14.0/faq.md +++ b/website/versioned_docs/version-0.14.0/faq.md @@ -120,13 +120,37 @@ Hudi provides snapshot isolation between all three types of processes - writers, ### Hudi’s commits are based on transaction start time instead of completed time. Does this cause data loss or inconsistency in case of incremental and time travel queries? -Let’s take a closer look at the scenario here: two commits C1 and C2 (with C2 starting later than C1) start with a later commit (C2) finishing first leaving the inflight transaction of the earlier commit (C1) before the completed write of the later transaction (C2) in Hudi’s timeline. This is not an uncommon scenario, especially with various ingestions needs such as backfilling, deleting, bootstrapping, etc alongside regular writes. When/Whether the first job would commit will depend on [...] - -In these scenarios, it might be tempting to think of data inconsistencies/data loss when using Hudi’s incremental queries. However, Hudi takes special handling in incremental queries to ensure that no data is served beyond the point Hudi sees an inflight instant in its timeline, so no data loss or drop happens. In this case, on seeing C1’s inflight commit (publish to timeline is atomic), C2 data (which is > C1 in the timeline) is not served until C1 inflight transitions to a terminal sta [...] +Let’s take a closer look at the scenario here: two commits C1 and C2 (with C2 starting later than C1) start with a later commit (C2) finishing first leaving the inflight transaction of the earlier commit (C1) +before the completed write of the later transaction (C2) in Hudi’s timeline. This is not an uncommon scenario, especially with various ingestions needs such as backfilling, deleting, bootstrapping, etc +alongside regular writes. When/Whether the first job would commit will depend on factors such as conflicts between concurrent commits, inflight compactions, other actions on the table’s timeline etc. +If the first job fails for some reason, Hudi will abort the earlier commit inflight (c1) and the writer has to retry next time with a new instant time > c2 much similar to other OCC implementations. +Firstly, for snapshot queries the order of commits should not matter at all, since any incomplete writes on the active timeline is ignored by queries and cause no side-effects. + +In these scenarios, it might be tempting to think of data inconsistencies/data loss when using Hudi’s incremental queries. However, Hudi takes special handling +(examples [1](https://github.com/apache/hudi/blob/aea5bb6f0ab824247f5e3498762ad94f643a2cb6/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java#L76), +[2](https://github.com/apache/hudi/blame/7a6543958368540d221ddc18e0c12b8d526b6859/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieInputFormatUtils.java#L173)) in incremental queries to ensure that no data +is served beyond the point there is an inflight instant in its timeline, so no data loss or drop happens. This detection is made possible because Hudi writes first request a transaction on the timeline, before planning/executing +the write, as explained in the [timeline](https://hudi.apache.org/docs/timeline#states) section. + +In this case, on seeing C1’s inflight commit (publish to timeline is atomic), C2 data (which is > C1 in the timeline) is not served until C1 inflight transitions to a terminal state such as completed or marked as failed. +This [test](https://github.com/apache/hudi/blob/master/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestHoodieIncrSource.java#L137) demonstrates how Hudi incremental source stops proceeding until C1 completes. +Hudi favors [safety and sacrifices liveness](https://en.wikipedia.org/wiki/Safety_and_liveness_properties), in such a case. For a single writer, the start times of the transactions are the same as the order of completion of transactions, and both incremental and time-travel queries work as expected. +In the case of multi-writer, incremental queries still work as expected but time travel queries don't. Since most time travel queries are on historical snapshots with a stable continuous timeline, this has not been implemented upto Hudi 0.13. +However, a similar approach like above can be easily applied to failing time travel queries as well in this window. ### How does Hudi plan to address the liveness issue above for incremental queries? -Hudi has had a proposal to streamline/improve this experience by adding a transition-time to our timeline, which will remove the [liveness sacrifice](https://en.wikipedia.org/wiki/Safety_and_liveness_properties) currently being made and makes it easier to understand. This has been delayed for a few reasons (a) Large hosted query engines and users not upgrading fast enough. (b) the issues brought up - \[[1](https://hudi.apache.org/docs/next/faq#does-hudis-use-of-wall-clock-timestamp-for-i [...] +Hudi 0.14 improves the liveness aspects by enabling change streams, incremental query and time-travel based on the file/object's timestamp (similar to [Delta Lake](https://docs.delta.io/latest/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel)). + +To expand more on the long term approach, Hudi has had a proposal to streamline/improve this experience by adding a transition-time to our timeline, which will remove the [liveness sacrifice](https://en.wikipedia.org/wiki/Safety_and_liveness_properties) and makes it easier to understand. +This has been delayed for a few reasons + +- Large hosted query engines and users not upgrading fast enough. +- The issues brought up - \[[1](https://hudi.apache.org/docs/next/faq#does-hudis-use-of-wall-clock-timestamp-for-instants-pose-any-clock-skew-issues),[2](https://hudi.apache.org/docs/next/faq#hudis-commits-are-based-on-transaction-start-time-instead-of-completed-time-does-this-cause-data-loss-or-inconsistency-in-case-of-incremental-and-time-travel-queries)\], + relevant to this are not practically very important to users beyond good pedantic discussions, +- Wanting to do it alongside [non-blocking concurrency control](https://github.com/apache/hudi/pull/7907) in Hudi version 1.x. + +It's planned to be addressed in the first 1.x release. ### Does Hudi’s use of wall clock timestamp for instants pose any clock skew issues?