This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 0752f62b5f6 [MINOR] Fix RFC index and updating 1.0 RFCs (#12480)
0752f62b5f6 is described below
commit 0752f62b5f619238e6161fb40dc6b07e0037bb1b
Author: vinoth chandar <[email protected]>
AuthorDate: Thu Dec 12 15:12:10 2024 -0800
[MINOR] Fix RFC index and updating 1.0 RFCs (#12480)
- Made a pass to fix status of all RFCs
- Complete RFC-69
- Redo RFC-77 based on new changes as of final RC
---
rfc/README.md | 166 +++++++++++++++++++++++++--------------------------
rfc/rfc-69/rfc-69.md | 4 +-
rfc/rfc-77/rfc-77.md | 148 +++++++++++++--------------------------------
3 files changed, 126 insertions(+), 192 deletions(-)
diff --git a/rfc/README.md b/rfc/README.md
index 9736a84ed0d..56765a620c8 100644
--- a/rfc/README.md
+++ b/rfc/README.md
@@ -34,87 +34,87 @@ The list of all RFCs can be found here.
> Older RFC content is still
> [here](https://cwiki.apache.org/confluence/display/HUDI/RFC+Process).
-| RFC Number | Title
| Status
|
-|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
-| 1 | [CSV Source Support for Delta
Streamer](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+01+%3A+CSV+Source+Support+for+Delta+Streamer)
|
`COMPLETED` |
-| 2 | [ORC Storage in
Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708439)
| `COMPLETED` |
-| 3 | [Timeline Service with Incremental File System View
Syncing](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708965)
|
`COMPLETED` |
-| 4 | [Faster Hive incremental pull
queries](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115513622)
| `COMPLETED` |
-| 5 | [HUI (Hudi
WebUI)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233)
| `ABANDONED` |
-| 6 | [Add indexing support to the log
file](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+06+%3A+Add+indexing+support+to+the+log+file)
|
`ABANDONED` |
-| 7 | [Point in time Time-Travel queries on Hudi
table](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table)
| `COMPLETED` |
-| 8 | [Metadata based Record Index](./rfc-8/rfc-8.md)
|
`COMPLETED` |
-| 9 | [Hudi Dataset Snapshot
Exporter](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter)
| `COMPLETED` |
-| 10 | [Restructuring and auto-generation of
docs](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs)
| `COMPLETED`
|
-| 11 | [Refactor of the configuration framework of hudi
project](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project)
| `ABANDONED` |
-| 12 | [Efficient Migration of Large Parquet Tables to Apache
Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi)
| `COMPLETED` |
-| 13 | [Integrate Hudi with
Flink](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=141724520)
| `COMPLETED` |
-| 14 | [JDBC incremental
puller](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller)
| `COMPLETED` |
-| 15 | [HUDI File Listing
Improvements](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
| `COMPLETED` |
-| 16 | [Abstraction for HoodieInputFormat and
RecordReader](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader)
| `COMPLETED` |
-| 17 | [Abstract common meta sync module support multiple meta
service](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service)
| `COMPLETED` |
-| 18 | [Insert Overwrite
API](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API)
| `COMPLETED` |
-| 19 | [Clustering data for freshness and query
performance](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance)
| `COMPLETED` |
-| 20 | [handle failed
records](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records)
| `ONGOING` |
-| 21 | [Allow HoodieRecordKey to be
Virtual](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+21+%3A+Allow+HoodieRecordKey+to+be+Virtual)
|
`COMPLETED` |
+| RFC Number | Title
| Status
|
+|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
+| 1 | [CSV Source Support for Delta
Streamer](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+01+%3A+CSV+Source+Support+for+Delta+Streamer)
|
`COMPLETED` |
+| 2 | [ORC Storage in
Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708439)
| `COMPLETED` |
+| 3 | [Timeline Service with Incremental File System View
Syncing](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708965)
|
`COMPLETED` |
+| 4 | [Faster Hive incremental pull
queries](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115513622)
| `COMPLETED` |
+| 5 | [HUI (Hudi
WebUI)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233)
| `ABANDONED` |
+| 6 | [Add indexing support to the log
file](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+06+%3A+Add+indexing+support+to+the+log+file)
|
`ABANDONED` |
+| 7 | [Point in time Time-Travel queries on Hudi
table](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table)
| `COMPLETED` |
+| 8 | [Metadata based Record Index](./rfc-8/rfc-8.md)
|
`COMPLETED` |
+| 9 | [Hudi Dataset Snapshot
Exporter](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter)
| `COMPLETED` |
+| 10 | [Restructuring and auto-generation of
docs](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs)
| `COMPLETED`
|
+| 11 | [Refactor of the configuration framework of hudi
project](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project)
| `ABANDONED` |
+| 12 | [Efficient Migration of Large Parquet Tables to Apache
Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi)
| `COMPLETED` |
+| 13 | [Integrate Hudi with
Flink](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=141724520)
| `COMPLETED` |
+| 14 | [JDBC incremental
puller](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller)
| `COMPLETED` |
+| 15 | [HUDI File Listing
Improvements](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
| `COMPLETED` |
+| 16 | [Abstraction for HoodieInputFormat and
RecordReader](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader)
| `COMPLETED` |
+| 17 | [Abstract common meta sync module support multiple meta
service](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service)
| `COMPLETED` |
+| 18 | [Insert Overwrite
API](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API)
| `COMPLETED` |
+| 19 | [Clustering data for freshness and query
performance](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance)
| `COMPLETED` |
+| 20 | [handle failed
records](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records)
| `ONGOING` |
+| 21 | [Allow HoodieRecordKey to be
Virtual](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+21+%3A+Allow+HoodieRecordKey+to+be+Virtual)
|
`COMPLETED` |
| 22 | [Snapshot Isolation using Optimistic Concurrency Control for
multi-writers](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+22+%3A+Snapshot+Isolation+using+Optimistic+Concurrency+Control+for+multi-writers)
| `COMPLETED` |
-| 23 | [Hudi Observability metrics
collection](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+23+%3A+Hudi+Observability+metrics+collection)
|
`ABANDONED` |
-| 24 | [Hoodie Flink Writer
Proposal](https://cwiki.apache.org/confluence/display/HUDI/RFC-24%3A+Hoodie+Flink+Writer+Proposal)
| `COMPLETED` |
-| 25 | [Spark SQL Extension For
Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+25%3A+Spark+SQL+Extension+For+Hudi)
| `COMPLETED` |
-| 26 | [Optimization For Hudi Table
Query](https://cwiki.apache.org/confluence/display/HUDI/RFC-26+Optimization+For+Hudi+Table+Query)
| `COMPLETED` |
-| 27 | [Data skipping index to improve query
performance](https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance)
| `COMPLETED` |
-| 28 | [Support Z-order
curve](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181307144)
| `COMPLETED` |
-| 29 | [Hash
Index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index)
| `COMPLETED` |
-| 30 | [Batch
operation](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+30%3A+Batch+operation)
| `ABANDONED` |
-| 31 | [Hive integration
Improvement](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment)
| `ONGOING` |
-| 32 | [Kafka Connect Sink for
Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi)
| `ONGOING` |
-| 33 | [Hudi supports more comprehensive Schema
Evolution](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution)
| `COMPLETED` |
-| 34 | [Hudi BigQuery Integration](./rfc-34/rfc-34.md)
|
`COMPLETED` |
-| 35 | [Make Flink MOR table writing streaming
friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly)
| `UNDER REVIEW` |
-| 36 | [HUDI Metastore
Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server)
| `ONGOING` |
-| 37 | [Hudi Metadata based Bloom Index](rfc-37/rfc-37.md)
|
`ONGOING` |
-| 38 | [Spark Datasource V2 Integration](./rfc-38/rfc-38.md)
|
`COMPLETED` |
-| 39 | [Incremental source for Debezium](./rfc-39/rfc-39.md)
|
`COMPLETED` |
-| 40 | [Hudi Connector for Trino](./rfc-40/rfc-40.md)
|
`COMPLETED` |
-| 41 | [Hudi Snowflake Integration](./rfc-41/rfc-41.md)
| `IN
PROGRESS` |
-| 42 | [Consistent Hashing Index](./rfc-42/rfc-42.md)
|
`ONGOING` |
-| 43 | [Table Management Service](./rfc-43/rfc-43.md)
| `IN
PROGRESS` |
-| 44 | [Hudi Connector for Presto](./rfc-44/rfc-44.md)
|
`COMPLETED` |
-| 45 | [Asynchronous Metadata Indexing](./rfc-45/rfc-45.md)
|
`COMPLETED` |
-| 46 | [Optimizing Record Payload Handling](./rfc-46/rfc-46.md)
|
`ONGOING` |
-| 47 | [Add Call Produce Command for Spark SQL](./rfc-47/rfc-47.md)
|
`COMPLETED` |
-| 48 | [LogCompaction for MOR tables](./rfc-48/rfc-48.md)
|
`ONGOING` |
-| 49 | [Support sync with DataHub](./rfc-49/rfc-49.md)
|
`COMPLETED` |
-| 50 | [Improve Timeline Server](./rfc-50/rfc-50.md)
| `IN
PROGRESS` |
-| 51 | [Change Data Capture](./rfc-51/rfc-51.md)
|
`ONGOING` |
-| 52 | [Introduce Secondary Index to Improve HUDI Query
Performance](./rfc-52/rfc-52.md)
| `ONGOING` |
-| 53 | [Use Lock-Free Message Queue Improving Hoodie Writing
Efficiency](./rfc-53/rfc-53.md)
| `COMPLETED` |
-| 54 | [New Table APIs and Streamline Hudi
Configs](./rfc-54/rfc-54.md)
| `UNDER REVIEW` |
-| 55 | [Improve Hive/Meta sync class design and
hierarchies](./rfc-55/rfc-55.md)
| `COMPLETED` |
-| 56 | [Early Conflict Detection For Multi-Writer](./rfc-56/rfc-56.md)
|
`COMPLETED` |
-| 57 | [DeltaStreamer Protobuf Support](./rfc-57/rfc-57.md)
|
`COMPLETED` |
-| 58 | [Integrate column stats index with all query
engines](./rfc-58/rfc-58.md)
| `UNDER REVIEW` |
-| 59 | [Multiple event_time Fields Latest Verification in a Single
Table](./rfc-59/rfc-59.md)
|
`UNDER REVIEW` |
-| 60 | [Federated Storage Layer](./rfc-60/rfc-60.md)
| `IN
PROGRESS` |
-| 61 | [Snapshot view management](./rfc-61/rfc-61.md)
| `UNDER
REVIEW` |
-| 62 | [Diagnostic Reporter](./rfc-62/rfc-62.md)
| `UNDER
REVIEW` |
-| 63 | [Expression Indexes](./rfc-63/rfc-63.md)
| `UNDER
REVIEW` |
-| 64 | [New Hudi Table Spec API for Query
Integrations](./rfc-64/rfc-64.md)
| `UNDER REVIEW` |
-| 65 | [Partition TTL Management](./rfc-65/rfc-65.md)
| `UNDER
REVIEW` |
-| 66 | [Lockless Multi-Writer Support](./rfc-66/rfc-66.md)
| `UNDER
REVIEW` |
-| 67 | [Hudi Bundle Standards](./rfc-67/rfc-67.md)
| `UNDER
REVIEW` |
-| 68 | [A More Effective HoodieMergeHandler for COW Table with
Parquet](./rfc-68/rfc-68.md)
|
`UNDER REVIEW` |
-| 69 | [Hudi 1.x](./rfc-69/rfc-69.md)
| `UNDER
REVIEW` |
-| 70 | [Hudi Reverse Streamer](./rfc/rfc-70/rfc-70.md)
| `UNDER
REVIEW` |
-| 71 | [Enhance OCC conflict detection](./rfc/rfc-71/rfc-71.md)
| `UNDER
REVIEW` |
-| 72 | [Redesign Hudi-Spark Integration](./rfc/rfc-72/rfc-72.md)
|
`ONGOING` |
-| 73 | [Multi-Table Transactions](./rfc-73/rfc-73.md)
| `UNDER
REVIEW` |
-| 74 | [`HoodieStorage`: Hudi Storage Abstraction and
APIs](./rfc-74/rfc-74.md)
| `UNDER REVIEW` |
-| 75 | [Hudi-Native HFile Reader and Writer](./rfc-75/rfc-75.md)
| `UNDER
REVIEW` |
-| 76 | [Auto Record key generation](./rfc-76/rfc-76.md)
| `IN
PROGRESS` |
-| 77 | [Secondary Index](./rfc-77/rfc-77.md)
| `UNDER
REVIEW` |
-| 78 | [1.0 Migration](./rfc-78/rfc-78.md)
| `IN
PROGRESS` |
-| 79 | [Robust handling of spark task retries and
failures](./rfc-79/rfc-79.md)
| `IN PROGRESS` |
-| 80 | [Column Families](./rfc-80/rfc-80.md)
| `UNDER
REVIEW` |
-| 81 | [Log Compaction with Merge Sort](./rfc-81/rfc-81.md)
| `UNDER
REVIEW` |
-| 82 | [Concurrent schema evolution detection](./rfc-82/rfc-82.md)
| `UNDER
REVIEW` |
+| 23 | [Hudi Observability metrics
collection](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+23+%3A+Hudi+Observability+metrics+collection)
|
`ABANDONED` |
+| 24 | [Hoodie Flink Writer
Proposal](https://cwiki.apache.org/confluence/display/HUDI/RFC-24%3A+Hoodie+Flink+Writer+Proposal)
| `COMPLETED` |
+| 25 | [Spark SQL Extension For
Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+25%3A+Spark+SQL+Extension+For+Hudi)
| `COMPLETED` |
+| 26 | [Optimization For Hudi Table
Query](https://cwiki.apache.org/confluence/display/HUDI/RFC-26+Optimization+For+Hudi+Table+Query)
| `COMPLETED` |
+| 27 | [Data skipping index to improve query
performance](https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance)
| `COMPLETED` |
+| 28 | [Support Z-order
curve](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181307144)
| `COMPLETED` |
+| 29 | [Hash
Index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index)
| `COMPLETED` |
+| 30 | [Batch
operation](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+30%3A+Batch+operation)
| `ABANDONED` |
+| 31 | [Hive integration
Improvement](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment)
| `ONGOING` |
+| 32 | [Kafka Connect Sink for
Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi)
| `ONGOING` |
+| 33 | [Hudi supports more comprehensive Schema
Evolution](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution)
| `COMPLETED` |
+| 34 | [Hudi BigQuery Integration](./rfc-34/rfc-34.md)
|
`COMPLETED` |
+| 35 | [Make Flink MOR table writing streaming
friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly)
| `COMPLETED` |
+| 36 | [HUDI Metastore
Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server)
| `ONGOING` |
+| 37 | [Hudi Metadata based Bloom Index](rfc-37/rfc-37.md)
|
`COMPLETED` |
+| 38 | [Spark Datasource V2 Integration](./rfc-38/rfc-38.md)
|
`COMPLETED` |
+| 39 | [Incremental source for Debezium](./rfc-39/rfc-39.md)
|
`COMPLETED` |
+| 40 | [Connector for Trino](./rfc-40/rfc-40.md)
|
`COMPLETED` |
+| 41 | [Snowflake Integration](./rfc-41/rfc-41.md), supported via
[Apache XTable (Incubating)](https://xtable.apache.org/)
|
`ABANDONED` |
+| 42 | [Consistent Hashing Index](./rfc-42/rfc-42.md)
| `ONGOING`
|
+| 43 | [Table Management Service](./rfc-43/rfc-43.md)
| `ONGOING`
|
+| 44 | [Hudi Connector for Presto](./rfc-44/rfc-44.md)
|
`COMPLETED` |
+| 45 | [Asynchronous Metadata Indexing](./rfc-45/rfc-45.md)
|
`COMPLETED` |
+| 46 | [Optimizing Record Payload Handling](./rfc-46/rfc-46.md)
|
`COMPLETED` |
+| 47 | [Add Call Produce Command for Spark SQL](./rfc-47/rfc-47.md)
|
`COMPLETED` |
+| 48 | [LogCompaction for MOR tables](./rfc-48/rfc-48.md)
|
`COMPLETED` |
+| 49 | [Support sync with DataHub](./rfc-49/rfc-49.md)
|
`COMPLETED` |
+| 50 | [Improve Timeline Server](./rfc-50/rfc-50.md)
| `IN
PROGRESS` |
+| 51 | [Change Data Capture](./rfc-51/rfc-51.md)
| `ONGOING`
|
+| 52 | [Introduce Secondary Index to Improve HUDI Query
Performance](./rfc-52/rfc-52.md)
| `ABANDONED` |
+| 53 | [Use Lock-Free Message Queue Improving Hoodie Writing
Efficiency](./rfc-53/rfc-53.md)
|
`COMPLETED` |
+| 54 | [New Table APIs and Streamline Hudi
Configs](./rfc-54/rfc-54.md)
| `UNDER REVIEW` |
+| 55 | [Improve Hive/Meta sync class design and
hierarchies](./rfc-55/rfc-55.md)
| `COMPLETED` |
+| 56 | [Early Conflict Detection For Multi-Writer](./rfc-56/rfc-56.md)
|
`COMPLETED` |
+| 57 | [DeltaStreamer Protobuf Support](./rfc-57/rfc-57.md)
|
`COMPLETED` |
+| 58 | [Integrate column stats index with all query
engines](./rfc-58/rfc-58.md)
| `UNDER REVIEW` |
+| 59 | [Multiple event_time Fields Latest Verification in a Single
Table](./rfc-59/rfc-59.md)
|
`UNDER REVIEW` |
+| 60 | [Federated Storage Layer](./rfc-60/rfc-60.md)
| `UNDER
REVIEW` |
+| 61 | [Snapshot view management](./rfc-61/rfc-61.md)
| `UNDER
REVIEW` |
+| 62 | [Diagnostic Reporter](./rfc-62/rfc-62.md)
| `UNDER
REVIEW` |
+| 63 | [Expression Indexes](./rfc-63/rfc-63.md)
| `ONGOING`
|
+| 64 | [New Hudi Table Spec API for Query
Integrations](./rfc-64/rfc-64.md)
| `UNDER REVIEW` |
+| 65 | [Partition TTL Management](./rfc-65/rfc-65.md)
| `UNDER
REVIEW` |
+| 66 | [Non Blocking Concurrency Control](./rfc-66/rfc-66.md)
| `UNDER
REVIEW` |
+| 67 | [Hudi Bundle Standards](./rfc-67/rfc-67.md)
| `UNDER
REVIEW` |
+| 68 | [A More Effective HoodieMergeHandler for COW Table with
Parquet](./rfc-68/rfc-68.md)
|
`UNDER REVIEW` |
+| 69 | [Hudi 1.x](./rfc-69/rfc-69.md)
|
`COMPLETED` |
+| 70 | [Hudi Reverse Streamer](./rfc/rfc-70/rfc-70.md)
| `UNDER
REVIEW` |
+| 71 | [Enhance OCC conflict detection](./rfc/rfc-71/rfc-71.md)
| `UNDER
REVIEW` |
+| 72 | [Redesign Hudi-Spark Integration](./rfc/rfc-72/rfc-72.md)
| `ONGOING`
|
+| 73 | [Multi-Table Transactions](./rfc-73/rfc-73.md)
| `UNDER
REVIEW` |
+| 74 | [`HoodieStorage`: Hudi Storage Abstraction and
APIs](./rfc-74/rfc-74.md)
| `ONGOING` |
+| 75 | [Hudi-Native HFile Reader and Writer](./rfc-75/rfc-75.md)
| `IN
PROGRESS` |
+| 76 | [Auto Record key generation](./rfc-76/rfc-76.md)
| `IN
PROGRESS` |
+| 77 | [Secondary Index](./rfc-77/rfc-77.md)
| `ONGOING`
|
+| 78 | [1.0 Migration](./rfc-78/rfc-78.md)
| `IN
PROGRESS` |
+| 79 | [Robust handling of spark task retries and
failures](./rfc-79/rfc-79.md)
| `IN PROGRESS` |
+| 80 | [Column Families](./rfc-80/rfc-80.md)
| `UNDER
REVIEW` |
+| 81 | [Log Compaction with Merge Sort](./rfc-81/rfc-81.md)
| `UNDER
REVIEW` |
+| 82 | [Concurrent schema evolution detection](./rfc-82/rfc-82.md)
| `UNDER
REVIEW` |
diff --git a/rfc/rfc-69/rfc-69.md b/rfc/rfc-69/rfc-69.md
index 7e2820fa1b4..4ca3cb95546 100644
--- a/rfc/rfc-69/rfc-69.md
+++ b/rfc/rfc-69/rfc-69.md
@@ -26,7 +26,7 @@
## Status
-Under Review
+Completed
## Abstract
@@ -164,7 +164,7 @@ JIRA Issues Filter for 1.0:
[link](https://issues.apache.org/jira/issues/?filter
| Presto/Trino queries | Change Presto/Trino connectors
to work with new format, integrate fully with metadata
| |
[HUDI-3210](https://issues.apache.org/jira/browse/HUDI-4394),
[HUDI-4394](https://issues.apache.org/jira/browse/HUDI-4394),[HUDI-4552](https://issues.apache.org/jira/browse/HUDI-4552)
|
-## Follow-on/1.1 Release
+## Follow-on releases
The RFC feedback process has generated some awesome new ideas, and we propose
to have the following be taken up post 1.0 release,
for easy sequencing of these projects. However, contributors can feel free to
drive these JIRAs/designs as they see fit.
diff --git a/rfc/rfc-77/rfc-77.md b/rfc/rfc-77/rfc-77.md
index dd488033ecb..1b69e8c8ec4 100644
--- a/rfc/rfc-77/rfc-77.md
+++ b/rfc/rfc-77/rfc-77.md
@@ -19,7 +19,6 @@
## Proposers
-- @bhat-vinay
- @codope
## Approvers
@@ -36,7 +35,7 @@ JIRA: https://issues.apache.org/jira/browse/HUDI-7146
In this RFC, we propose implementing Secondary Indexes (SI), a new capability
in Hudi's metadata table (MDT) based indexing
system. SI are indexes defined on user specified columns of the table.
Similar to record level indexes,
-SI will improve query performance when the query predicate contains secondary
keys. The number of files
+SI will improve query performance when the query predicate involves those
secondary columns. The number of files
that a query needs to scan can be pruned down using secondary indexes.
## Background
@@ -60,6 +59,7 @@ not in the scope of this RFC.
## Design and Implementation
This section discusses briefly the goals, design, implementation details of
supporting SI in Hudi. At a high level,
the design principle and goals are as follows:
+
1. User specifies SI to be built on a given column of a table. A given SI can
be built on only one column of the table
(i.e composite keys are not allowed). Any number of SI can be built on a Hudi
table. The indexes to be built are
specified using regular SQL statements.
@@ -72,9 +72,10 @@ indexes.
SI can be created using the regular `CREATE INDEX` SQL statement.
```
-- PROPOSED SYNTAX WITH `secondary_index` as the index type --
-CREATE INDEX [IF NOT EXISTS] index_name ON [TABLE] table_name [USING
secondary_index](index_column)
+CREATE INDEX [IF NOT EXISTS] index_name ON [TABLE] table_name [USING
index_type](index_column)
-- Examples --
-CREATE INDEX idx_city on hudi_table USING secondary_index(city)
+CREATE INDEX idx_city on hudi_table USING bloom_filters(city)
+-- Default is to create a hash based record level index mapping secondary
column to RLI entries.
CREATE INDEX idx_last_name on hudi_table (last_name)
-- NO CHANGE IN DROP INDEX --
@@ -88,7 +89,7 @@ in MDT by prefixing `secondary_index_`. If the `index_name`
is `idx_city`, then
The index_type will be `secondary_index`. This will be used to distinguish SI
from other Functional Indexes.
### Secondary Index Metadata
-Secondary index metadata will be managed the same way as Functional Index
metadata. Since SI will not have any function
+Secondary index metadata will be managed the same way as Expression Index
metadata. If the SI does not have any function
to be applied on each row, the `function_name` will be NULL.
### Index in Metadata Table (MDT)
@@ -97,7 +98,7 @@ prefixing `secondary_index_`. Each entry in the SI partition
will be a mapping o
`secondary_key -> record_key`. `secondary_key` will form the "record key" for
the record of the SI partition. Note that
an important design consideration here is that users may choose to build SI on
a non-unique column of the table.
-#### Index Initialisation
+#### Index Initialization
Initial build of the secondary index will scan all file slices (of the base
table) to extract
`secondary-key -> record-key` tuple and write it into the secondary index
partition in the metadata table.
This is similar to how RLI is initialised.
@@ -107,10 +108,9 @@ The index needs to be updated on inserts, updates and
deletes to the base table.
the base table could be non-unique, this process differs significantly
compared to RLI.
##### Inserts (on the base table)
-Newly inserted row's record-key and secondary-key is required to build the
secondary-index entry. The record key is
-already stored in the `WriteStatus` and commit metadata has the files touched
by that commit. `WriteStatus` will be enhanced to store the secondary-key
values (for all
-those columns on which secondary index is defined). The metadata writer will
extract this information and write it out
-to the secondary index partition. [1]
+Newly inserted row's record-key and secondary-key is required to build the
secondary-index entry. The commit metadata has the files affected by that
commit.
+The metadata writer will extract the newly written records based on the commit
metadata and generate the secondary-key values (for all
+those columns on which secondary index is defined) to the secondary index
partition. [1]
##### Updates (on the base table)
Similar to inserts, the `secondary-key -> record-key` tuples are extracted
from the WriteStatus. However, additional
@@ -131,7 +131,7 @@ Another key observation here is that `old-secondary-key` is
required to construc
data systems, Hudi does not read the old-image of a row on updates until a
merge is executed. It detects that a row is getting updated by simply
reading the index and appending the updates in log files. Hence, there needs
to be a mechanism to extract `old-secondary-key`. We propose
`old-secondary-key` to be extracted by scanning the MDT partition (hosting the
SI) and doing a reverse lookup based
-on the `record-key` of the row being updated. It should be noted that this is
going to be expensive operation as the
+on the `record-key` of the row being updated. It should be noted that this
might be an expensive operation as the
base table grows in size (which inherently means that SI will grow in size) in
terms of number of rows. One way to
optimize this is to build a reverse mapping `record-key -> secondary-key` in a
different MDT partition. This is
left as a TBD (as of this writing).
@@ -153,8 +153,8 @@ records is used to identify candidate records that need to
be merged. The 'key'
`record-key` and by definition it is unique. But, the keys for secondary index
entries are the `secondary-keys` which
can be non-unique. Hence, the merging of SI entries will make use of the
payload i.e `record-key` in the
`secondary-key -> record-key` tuple to identify candidate records that need to
be merged. It will also be guided by the
-tombstone record emitted during update or deletes. An example is provided here
on how the different log files are merged and how the merged log
-records are finally merged with the base file to obtain the merged records (of
the MDT partition hosting SI).
+tombstone record emitted during update or deletes. An example is provided here
on how the different log files are merged
+and how the merged log records are finally merged with the base file to obtain
the merged records (of the MDT partition hosting SI).
Consider the following table, `trips_table`. Note that this table is only used
to illustrate the merging logic and not
to be used as a definitive table for other considertaion (for example, the
performance aspect of some of the algorithm
@@ -215,98 +215,11 @@ on `secondary-key` and second search will be based on
`record-key`. Hence, a sin
uses a flat array will not be efficient.
4. Should allow for efficient insertion of records (for inserting merged
record and for buffering fresh records).
-The [initial POC](https://github.com/apache/hudi/pull/10625) makes use of an
in-memory nested maps - with the first level keyed by `secondary-key`
-and the second level keyed by`record-key`. However, the final design should
allow spilling to disk.
-
-Considering the above requirements, the proposal is to introduce a new class
hierarchy for handling merge keys in a more
-flexible and decoupled manner. It adds the `HoodieMergeKey` interface, along
with two
-implementations: `HoodieSimpleMergeKey` and `HoodieCompositeMergeKey`.
-```java
-public interface HoodieMergeKey extends Serializable {
-
- /**
- * Get the partition path.
- */
- String getPartitionPath();
-
-
- /**
- * Get the record key.
- */
- Serializable getRecordKey();
-
- /**
- * Get the hoodie key.
- * For simple merge keys, this is used to directly fetch the HoodieKey,
which is a combination of record key and partition path.
- */
- default HoodieKey getHoodieKey() {
- return new HoodieKey(getRecordKey().toString(), getPartitionPath());
- }
-}
-```
-
-`HoodieSimpleMergeKey` simply wraps `HoodieKey` for existing scenarios where
the key is a
-string. `HoodieCompositeMergeKey` allows for complex types as keys, enhancing
flexibility for scenarios where a simple
-string key is not sufficient.
-
-```java
-public class HoodieSimpleMergeKey implements HoodieMergeKey {
-
- private final HoodieKey simpleKey;
-
- public HoodieSimpleMergeKey(HoodieKey simpleKey) {
- this.simpleKey = simpleKey;
- }
-
- @Override
- public String getPartitionPath() {
- return simpleKey.getPartitionPath();
- }
-
- @Override
- public Serializable getRecordKey() {
- return simpleKey.getRecordKey();
- }
-
- public HoodieKey getHoodieKey() {
- return simpleKey;
- }
-}
-
-public class HoodieCompositeMergeKey<K extends Serializable> implements
HoodieMergeKey {
-
- private final K compositeKey;
- private final String partitionPath;
-
- public HoodieCompositeMergeKey(K compositeKey, String partitionPath) {
- this.compositeKey = compositeKey;
- this.partitionPath = partitionPath;
- }
-
- @Override
- public String getPartitionPath() {
- return partitionPath;
- }
-
- @Override
- public Serializable getRecordKey() {
- return compositeKey;
- }
-}
-```
-
-We also introduce a new `HoodieRecordMerger` implementation based on
`HoodieMergeKey`. For other keys, it falls back to
-merge method of parent class. The new record merger will be used in
`HoodieMergedLogRecordScanner` to merge records from
-MDT partition hosting SI.
-
-The primary advantage of this approach is that we do not leak any details to
the upper layers such as merge handle.
-However, `HoodieMetadataLogRecordReader` should create the
`HoodieMergedLogRecordScanner` with the
-correct `HoodieRecordMerger` implementation, instead of
-the [current record
merger](https://github.com/apache/hudi/blob/cb6eb6785fdeb88e66016a2b8c0c6e6fa184b309/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataLogRecordReader.java#L156).
-
-These changes do not affect existing functionalities that do not rely on merge
keys. It introduces additional classes
-that are used explicitly for new functionalities involving various key types
in merging operations. This ensures minimal
-to no risk for existing processes.
+This is achieved by the following efficient encoding of the reverse mapping of
secondary values to record keys in the
+MDT partition. We exploit a key observation that its enough to merge SI
entries and tombstones with a tuple key
+`{secondaryKey, recordKey}` through the existing spillable/merge map
implementation. We store a flattened version of the
+logical multimap as a key-value format with key : `secondaryKey+recordKey` and
value `isDeleted: false|true` indicating
+whether this is a tombstone or an insert of the SI entry.
### Comparing alternate design proposals
@@ -315,8 +228,7 @@ Here are some alternate options that we considered:
1. Extend Hudi's `ExternalSpillableMap` to support multi-map. More signicant
refactoring is required and it would have
leaked implementation details to the write handle layer, as the records
held by `ExternalSpillableMap` is exposed to
write handle via `HoodieMeredLogRecordScanner::getRecords`.
-2. Write spillable version
- of [Guava's
multi-map](https://github.com/google/guava/wiki/NewCollectionTypesExplained#multimap).
Apart from reason
+2. Write spillable version of [Guava's
multi-map](https://github.com/google/guava/wiki/NewCollectionTypesExplained#multimap).
Apart from reason
mentioned above, we did not want to add a third-party dependency on Guava.
3. Use [Chronicle map](https://github.com/OpenHFT/Chronicle-Map). Same reasons
as above.
4. Use two different spillable data structures - one is a set of
`secondary-key` and the other is map of
@@ -332,6 +244,28 @@ result set.
3. Indexing strategy should be accompanied by performance test results showing
its benefits on the query path (and optionally
overhead on the index maintenance (write) path)
+### SparkSQL Benchmark
+
+Benchmarks of the implementation shows some impressive gains.
+
+Table used - web\_sales (from 10 TB tpc-ds dataset)
+Total File groups - 286,603
+Total Records - 7,198,162,544
+Cardinality of Secondary index column ~ 1:150 (not too high or not too low)
+
+| Run | Query latency w/o data skipping (secs) | Query latency w/ Data
Skipping (leveraging SI) (secs) | Improvement |
+| ---| ---| ---| --- |
+| 1 | 252 | 31 | ~88% |
+| 2 | 214 | 10 | ~95% |
+| 3 | 204 | 9 | ~95% |
+
+| | Stats w/o data skipping | Stats w/ Data Skipping (leveraging SI) |
+| ---| ---| --- |
+| Number of Files read | 286,603 | 150 |
+| size of files read | 759 GB | 593 MB |
+| Total Number of Rows read | 7,198,162,544 | 5,811,187 |
+
+
## Future enhancements and roadmap
The feature can evolve to provide additional functionalities.