zhangyue19921010 commented on code in PR #12407:
URL: https://github.com/apache/hudi/pull/12407#discussion_r1870573138
##########
rfc/rfc-60/rfc-60.md:
##########
@@ -178,6 +275,130 @@ The hashing function should be made user configurable for
use cases like bucketi
sub-partitioning/re-hash to reduce the number of hash prefixes. Having too
many unique hash prefixes
would make files too dispersed, and affect performance on other operations
such as listing.
+### Case2: Hudi Storage Cache Layer
+
+The cache layer of a lake table is a specific implementation scenario for
Federated Storage Layout in hudi tables.
+It divides the physical storage of the lake table into a high-performance
HDFS-based cache layer and a shared HDFS-based persistent layer.
+Hot data is first written to the cache layer and later moved to the persistent
layer through table services like Compaction and Clustering.
+This setup addresses the strong demands for performance and stability in
scenarios involving massive data ingestion into the lake.
+It is important to note that once data is written to the cache layer and
committed, it becomes visible to downstream processes, regardless of when
+the "relocation work" starts or finishes. Additionally, since the data
relocation from the cache layer to the persistent layer leverages the lake
table's
+own Compaction and Clustering capabilities, this process adheres to the lake
table's commit mechanism and MVCC snapshot isolation design. Therefore,
+after enabling the cache layer, the lake table maintains atomicity,
transactional guarantees, and Exactly Once semantics.
+
+Below is a comparison between the normal read/write process of a lake table
and the process after enabling the lake table cache layer.
+Green arrows indicate data reads, and red arrows indicate data writes. Before
enabling the cache layer, the compute engine directly writes
+data to the shared HDFS and commits, including Parquet base files, log files,
and metadata files. The compute engine queries the lake table
+by directly reading data from the shared HDFS. Additionally, the lake table
services for Clustering and Compaction also directly query data from
+the shared HDFS, process it, and write it back to the shared HDFS. After
enabling the lake table cache layer, the compute engine first writes hot
+data to the high-performance HDFS and commits, including Parquet files and log
files. During queries, a unified logical view of both the cache layer
+and the persistent layer is constructed to meet query demands. The Clustering
and Compaction table services, while performing regular lake table file
+organization, also facilitate data relocation from the cache layer to the
persistent layer. Notably, regardless of when Clustering and Compaction jobs
+start or finish, the data visible to downstream processes is always complete
and timely.
+
+Original Read/Write workflow
+
+
+Read/Write workflow with hudi cache layer enabled
+
+
+#### HoodieCacheLayerStorageStrategy
+Based on the HoodieActiveTimeline and the current write instant, determine the
specific write path. For common commit operations
+in COW (Copy on Write) tables and delta commit operations in MOR (Merge on
Read) tables, we generate a CacheLayer-related Storage
+Path to write/read. This type of I/O is targeted at the cache layer.
+
+As for commit action in mor table and replace commit in cow table, we will
generate a persistent Storage Path which will let
+Compaction/Clustering do the data migration works from cache layer to
persistent layer
+
+Note: It is required that Clustering is enabled for COW tables and Compaction
is enabled for MOR tables; otherwise, there is a
+risk of storage overflow in the cache layer.
+
+```java
+/**
+ * When using Storage Cache Layer make sure that table service is enabled :
+ * 1. MOR + Upsert + Compaction
+ * 2. COW + Insert + Clustering
+ */
+public class HoodieCacheLayerStorageStrategy extends
HoodieDefaultStorageStrategy {
+
+ private HoodieTableType tableType;
+ private String hoodieStorageStrategyModifyTime;
Review Comment:
no need actually. As we discuss, we only focus on abstraction, so removed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]