bhasudha commented on code in PR #9346: URL: https://github.com/apache/hudi/pull/9346#discussion_r1291685815
########## website/docs/indexing.md: ########## @@ -20,34 +24,90 @@ _Figure: Comparison of merge cost for updates (yellow blocks) against base files ## Index Types in Hudi -Currently, Hudi supports the following indexing options. - -- **Bloom Index (default):** Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. -- **Simple Index:** Performs a lean join of the incoming update/delete records against keys extracted from the table on storage. -- **HBase Index:** Manages the index mapping in an external Apache HBase table. +Currently, Hudi supports the following index types. Default is SIMPLE on Spark engine, and INMEMORY on Flink and Java +engines. + +- **BLOOM:** Employs bloom filters built out of the record keys, optionally also pruning candidate files using + record key ranges.Key uniqueness is enforced inside partitions. +- **GLOBAL_BLOOM:** Employs bloom filters built out of the record keys, optionally also pruning candidate files using + record key ranges. Key uniqueness is enforced across all partitions in the table. +- **SIMPLE (default for Spark engines):** Default index type for spark engine. Performs a lean join of the incoming update/delete records against keys extracted from the table on + storage. Key uniqueness is enforced inside partitions. +- **GLOBAL_SIMPLE:** Performs a lean join of the incoming update/delete records against keys extracted from the table on + storage. Key uniqueness is enforced across all partitions in the table. +- **HBASE:** Manages the index mapping in an external Apache HBase table. +- **INMEMORY (default for Flink and Java):** Uses in-memory hashmap in Spark and Java engine and Flink in-memory state in Flink for indexing. +- **BUCKET:** Employs bucket hashing to locates the file group containing the records. Particularly beneficial in + large scale. Use `hoodie.index.bucket.engine` to choose bucket engine type, i.e., how buckets are generated; + - `SIMPLE(default)`: Uses a fixed number of buckets for file groups which cannot shrink or expand. This works for both COW and + MOR tables. + - `CONSISTENT_HASHING`: Supports dynamic number of buckets with bucket resizing to properly size each bucket. This + solves potential data skew problem where one bucket can be significantly larger than others in SIMPLE engine type. Review Comment: rephrased it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
