Re: [PR] [HUDI-3625][RFC-60] Update RFC-60 With Federated Storage Layout [hudi]

via GitHub Tue, 03 Dec 2024 22:42:45 -0800


zhangyue19921010 commented on code in PR #12407:
URL: https://github.com/apache/hudi/pull/12407#discussion_r1868799544



##########
rfc/rfc-60/rfc-60.md:
##########
@@ -18,33 +18,68 @@
 # RFC-60: Federated Storage Layout
 
 ## Proposers
+- @zhangyue19921010
+- @CTTY
 - @umehrot2
 
 ## Approvers
 - @vinoth
 - @shivnarayan
+- @yihua
 
 ## Status
 
 JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
 
 ## Abstract
+In this RFC, we will support the Federated Storage Layout for Hudi tables, 
enabling Hudi to support multiple pluggable physical 
+storage systems. By combining Hudi's own metadata, we can construct logical 
table views and expose these views externally,
+making them transparent to the engine.
 
+![](FederatedStorageLayout.png)
+
+After implementing the Hudi Federated Storage Layout, we can develop many 
interesting new features for Hudi lake tables based on this, such as:
+
+### Object Store Optimized Layout
 As you scale your Apache Hudi workloads over cloud object stores like Amazon 
S3, there is potential of hitting request
 throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
 layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
 significantly reduce throttling.
 
+![](ObjectStoreOptimizedLayout.png)
+
+### Hudi Storage Cache Layer
+The Hudi lake table data cache layer involves dividing the lake table physical 
storage into a high-performance HDFS-based data 
+cache layer and a shared HDFS-based data persistence layer. Hot data is 
initially written to the cache layer and later moved to 
+the persistence layer through table services like Compaction and Clustering. 
This approach meets the strong demands for performance 
+and stability in scenarios involving massive data ingestion into the lake. It 
is important to note that once data is written to the 
+cache layer and committed, it becomes visible to downstream consumers, 
regardless of when the subsequent "moving operations" start 
+or finish, ensuring data visibility and timeliness are unaffected. 
Additionally, since the data movement from the cache layer to 
+the persistence layer leverages the lake table's own Compaction and Clustering 
table service capabilities, this process adheres to 
+the lake table's commit mechanism and MVCC snapshot isolation design. 
Therefore, with the data cache layer enabled, the lake table 
+maintains atomicity, transactional guarantees, and Exactly Once semantics.
+
+![](HudiStorageCacheLayer.png)
+
+### Hudi Table Second-Level Latency

Review Comment:
   Hey @CTTY . Thanks for your review. My thought is that Case 3 is another 
implementation of the Federated Storage Layout, rather than just a concept of a 
caching layer. In Case 3, we effectively use Kafka as a log storage (since both 
follow an append-only model like avro) and implement a complete set of 
HoodieKafkaLogScanner and corresponding Compactor. Regarding data visibility in 
Case 3, the data should be visible externally as soon as it enters Kafka, 
without a strong dependency on Compaction/Clustering operations. For the query 
side, it will return a union of data from Kafka and the base files. Of course, 
it also needs to provide the capability to read/stream Kafka logs in a 
read-only manner. In summary, this approach aims to achieve second-level 
latency and visibility for lake tables based on the Hudi Federated Storage 
Layout. Also update to rfc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-3625][RFC-60] Update RFC-60 With Federated Storage Layout [hudi]

Reply via email to