Re: [PR] [HUDI-3625][RFC-60] Update RFC-60 With Federated Storage Layout [hudi]

via GitHub Wed, 04 Dec 2024 19:42:07 -0800


zhangyue19921010 commented on code in PR #12407:
URL: https://github.com/apache/hudi/pull/12407#discussion_r1870573138



##########
rfc/rfc-60/rfc-60.md:
##########
@@ -178,6 +275,130 @@ The hashing function should be made user configurable for 
use cases like bucketi
 sub-partitioning/re-hash to reduce the number of hash prefixes. Having too 
many unique hash prefixes
 would make files too dispersed, and affect performance on other operations 
such as listing.
 
+### Case2: Hudi Storage Cache Layer
+
+The cache layer of a lake table is a specific implementation scenario for 
Federated Storage Layout in hudi tables. 
+It divides the physical storage of the lake table into a high-performance 
HDFS-based cache layer and a shared HDFS-based persistent layer. 
+Hot data is first written to the cache layer and later moved to the persistent 
layer through table services like Compaction and Clustering. 
+This setup addresses the strong demands for performance and stability in 
scenarios involving massive data ingestion into the lake. 
+It is important to note that once data is written to the cache layer and 
committed, it becomes visible to downstream processes, regardless of when 
+the "relocation work" starts or finishes. Additionally, since the data 
relocation from the cache layer to the persistent layer leverages the lake 
table's 
+own Compaction and Clustering capabilities, this process adheres to the lake 
table's commit mechanism and MVCC snapshot isolation design. Therefore, 
+after enabling the cache layer, the lake table maintains atomicity, 
transactional guarantees, and Exactly Once semantics.
+
+Below is a comparison between the normal read/write process of a lake table 
and the process after enabling the lake table cache layer. 
+Green arrows indicate data reads, and red arrows indicate data writes. Before 
enabling the cache layer, the compute engine directly writes 
+data to the shared HDFS and commits, including Parquet base files, log files, 
and metadata files. The compute engine queries the lake table 
+by directly reading data from the shared HDFS. Additionally, the lake table 
services for Clustering and Compaction also directly query data from 
+the shared HDFS, process it, and write it back to the shared HDFS. After 
enabling the lake table cache layer, the compute engine first writes hot 
+data to the high-performance HDFS and commits, including Parquet files and log 
files. During queries, a unified logical view of both the cache layer 
+and the persistent layer is constructed to meet query demands. The Clustering 
and Compaction table services, while performing regular lake table file 
+organization, also facilitate data relocation from the cache layer to the 
persistent layer. Notably, regardless of when Clustering and Compaction jobs 
+start or finish, the data visible to downstream processes is always complete 
and timely.
+
+Original Read/Write workflow
+![originalReadWrite.png](originalReadWrite.png)
+
+Read/Write workflow with hudi cache layer enabled
+![ReadWriteWithCacheLayer.png](ReadWriteWithCacheLayer.png)
+
+#### HoodieCacheLayerStorageStrategy
+Based on the HoodieActiveTimeline and the current write instant, determine the 
specific write path. For common commit operations 
+in COW (Copy on Write) tables and delta commit operations in MOR (Merge on 
Read) tables, we generate a CacheLayer-related Storage 
+Path to write/read. This type of I/O is targeted at the cache layer.
+
+As for commit action in mor table and replace commit in cow table, we will 
generate a persistent Storage Path which will let 
+Compaction/Clustering do the data migration works from cache layer to 
persistent layer
+
+Note: It is required that Clustering is enabled for COW tables and Compaction 
is enabled for MOR tables; otherwise, there is a 
+risk of storage overflow in the cache layer.
+
+```java
+/**
+ * When using Storage Cache Layer make sure that table service is enabled :
+ * 1. MOR + Upsert + Compaction
+ * 2. COW + Insert + Clustering
+ */
+public class HoodieCacheLayerStorageStrategy extends 
HoodieDefaultStorageStrategy {
+
+  private HoodieTableType tableType;
+  private String hoodieStorageStrategyModifyTime;

Review Comment:
   no need actually. As we discuss, we only  focus on abstraction, so removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-3625][RFC-60] Update RFC-60 With Federated Storage Layout [hudi]

Reply via email to