bvaradar commented on code in PR #12998: URL: https://github.com/apache/hudi/pull/12998#discussion_r2008515148
########## rfc/rfc-92/rfc-92.md: ########## @@ -0,0 +1,102 @@ + +# RFC-92: Pluggable Table Formats in Hudi + +## Proposers + +* Balaji Varadarajan + +## Approvers + +* Vinoth Chandar +* Ethan Guo + +## Status + +JIRA: <TBD> + +## Abstract + +This RFC proposes support for different backing table format implementations inside Hudi. For the past 4 years at-least, we have been consistently defining Hudi as a broader platform and software [stack](https://hudi.apache.org/docs/hudi_stack) that delivers much of these benefits. Hudi's table format makes choices specific to data lake workloads, allowing efficient read/write (even the recent [blog](https://bytearray.substack.com/p/computer-science-behind-lakehouse) from Vinoth), has major differences and advantages compared to other approaches. The community plans to centrally focus on the native Hudi storage format. + +However, there may be benefits to allowing other storage layouts/table formats to fit under Hudi's higher level functionality. This also has non-technical benefits of insulating the project from vendor marketing wars. Most contributors (such as myself) are happily part of the global Hudi open-source community, for the sake of just building technology. + +## Background + +Expanding further, there are plenty of valid technical reasons on why Hudi should allow different storage layouts, under the upper layer reader/writer and table services implementations. + +1. We have use-cases, for cloud-native/high performance implementations of timeline (\`HoodieTimeline\`) and metadata (\`HoodieMetadata\` interface). In our use-case, we would like to explore backing them using NoSQL datastore like DynamoDB, for ultra-low latency queries. +2. Hudi already supports [different](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieStorageLayout.java) storage formats/layouts. Tables can be bucketed, consistent hashed or organized by having data laid out in order of arrival (default). +3. Hudi already allows plug-ability and customization at various layers like record merger, indexes and other core write/read paths. +4. It's very standard practice in databases to allow multiple storage backends (MySQL supports myISAM, innodb/btree, myrocks/lsm). This may be crucial step towards the database northstar vision. Review Comment: True. but beyond technical reasons, it seems to matter how exactly the bits are written in storage. for e.g Hudi's delete block and Delta's deletion vectors may be encoding same bitmaps, but everything from file name to the container format seems to matter for cloud query engines to integrate more easily -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
