zhangyue19921010 commented on code in PR #12407: URL: https://github.com/apache/hudi/pull/12407#discussion_r1868799544
########## rfc/rfc-60/rfc-60.md: ########## @@ -18,33 +18,68 @@ # RFC-60: Federated Storage Layout ## Proposers +- @zhangyue19921010 +- @CTTY - @umehrot2 ## Approvers - @vinoth - @shivnarayan +- @yihua ## Status JIRA: [https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625) ## Abstract +In this RFC, we will support the Federated Storage Layout for Hudi tables, enabling Hudi to support multiple pluggable physical +storage systems. By combining Hudi's own metadata, we can construct logical table views and expose these views externally, +making them transparent to the engine. + + +After implementing the Hudi Federated Storage Layout, we can develop many interesting new features for Hudi lake tables based on this, such as: + +### Object Store Optimized Layout As you scale your Apache Hudi workloads over cloud object stores like Amazon S3, there is potential of hitting request throttling limits which in-turn impacts performance. In this RFC, we are proposing to support an alternate storage layout that is optimized for Amazon S3 and other cloud object stores, which helps achieve maximum throughput and significantly reduce throttling. + + +### Hudi Storage Cache Layer +The Hudi lake table data cache layer involves dividing the lake table physical storage into a high-performance HDFS-based data +cache layer and a shared HDFS-based data persistence layer. Hot data is initially written to the cache layer and later moved to +the persistence layer through table services like Compaction and Clustering. This approach meets the strong demands for performance +and stability in scenarios involving massive data ingestion into the lake. It is important to note that once data is written to the +cache layer and committed, it becomes visible to downstream consumers, regardless of when the subsequent "moving operations" start +or finish, ensuring data visibility and timeliness are unaffected. Additionally, since the data movement from the cache layer to +the persistence layer leverages the lake table's own Compaction and Clustering table service capabilities, this process adheres to +the lake table's commit mechanism and MVCC snapshot isolation design. Therefore, with the data cache layer enabled, the lake table +maintains atomicity, transactional guarantees, and Exactly Once semantics. + + + +### Hudi Table Second-Level Latency Review Comment: Hey @CTTY . Thanks for your review. My thought is that Case 3 is another implementation of the Federated Storage Layout, rather than just a concept of a caching layer. In Case 3, we effectively use Kafka as a log storage (since both follow an append-only model like avro) and implement a complete set of HoodieKafkaLogScanner and corresponding Compactor. Regarding data visibility in Case 3, the data should be visible externally as soon as it enters Kafka, without a strong dependency on Compaction/Clustering operations. For the query side, it will return a union of data from Kafka and the base files. Of course, it also needs to provide the capability to read/stream Kafka logs in a read-only manner. In summary, this approach aims to achieve second-level latency and visibility for lake tables based on the Hudi Federated Storage Layout. Also update to rfc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
