[GitHub] [hudi] bhasudha commented on a diff in pull request #9346: [DOCS] Update Indexing page with all index types and file layout page

via GitHub Wed, 02 Aug 2023 12:14:48 -0700


bhasudha commented on code in PR #9346:
URL: https://github.com/apache/hudi/pull/9346#discussion_r1282317869



##########
website/docs/indexing.md:
##########
@@ -20,34 +22,79 @@ _Figure: Comparison of merge cost for updates (yellow 
blocks) against base files
 
 ## Index Types in Hudi
 
-Currently, Hudi supports the following indexing options.
-
-- **Bloom Index (default):** Employs bloom filters built out of the record 
keys, optionally also pruning candidate files using record key ranges.
-- **Simple Index:** Performs a lean join of the incoming update/delete records 
against keys extracted from the table on storage.
-- **HBase Index:** Manages the index mapping in an external Apache HBase table.
+Currently, Hudi supports the following index types. Default is SIMPLE on Spark 
engine, and INMEMORY on Flink and Java 
+engines.
+
+- **BLOOM:** Employs bloom filters built out of the record keys, optionally 
also pruning candidate files using 
+  record key ranges.Key uniqueness is enforced inside partitions.
+- **GLOBAL_BLOOM:** Employs bloom filters built out of the record keys, 
optionally also pruning candidate files using 
+  record key ranges. Key uniqueness is enforced across all partitions in the 
table.
+- **SIMPLE (default for Spark engines):** Default index type for spark engine. 
Performs a lean join of the incoming update/delete records against keys 
extracted from the table on 
+  storage. Key uniqueness is enforced inside partitions. 
+- **GLOBAL_SIMPLE:** Performs a lean join of the incoming update/delete 
records against keys extracted from the table on
+  storage. Key uniqueness is enforced across all partitions in the table.
+- **HBASE:** Manages the index mapping in an external Apache HBase table.
+- **INMEMORY (default for Flink and Java):** Uses in-memory hashmap in Spark 
and Java engine and Flink in-memory state in Flink for indexing.
+- **BUCKET:** Employs bucket hashing to locates the file group containing the 
records. Particularly beneficial in 
+  large scale. Use `hoodie.index.bucket.engine` to choose bucket engine type, 
i.e., how buckets are generated;
+  - `SIMPLE(default)`: Uses a fixed number of buckets for file groups which 
cannot shrink or expand. This works for both COW and 
+     MOR tables.
+  - `CONSISTENT_HASHING`: Supports dynamic number of buckets with bucket 
resizing to properly size each bucket. This 
+     solves potential data skew problem where one bucket can be significantly 
larger than others in SIMPLE engine type. 
+     This only works with MOR tables.
+- **RECORD_INDEX:** Index which saves the record key to location mappings in 
the HUDI Metadata Table. Record index is a 
+  global index, enforcing key uniqueness across all partitions in the table. 
Supports sharding to achieve very high scale.
 - **Bring your own implementation:** You can extend this [public 
API](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java)
 
 to implement custom indexing.
 
 Writers can pick one of these options using `hoodie.index.type` config option. 
Additionally, a custom index implementation can also be employed
 using `hoodie.index.class` and supplying a subclass of `SparkHoodieIndex` (for 
Apache Spark writers)
 
+### Global and Non-Global Indexes
+
 Another key aspect worth understanding is the difference between global and 
non-global indexes. Both bloom and simple index have
-global options - `hoodie.index.type=GLOBAL_BLOOM` and 
`hoodie.index.type=GLOBAL_SIMPLE` - respectively. HBase index is by nature a 
global index.
+global options - `hoodie.index.type=GLOBAL_BLOOM` and 
`hoodie.index.type=GLOBAL_SIMPLE` - respectively. Record index and 
+HBase index are by nature a global index.
 
 - **Global index:**  Global indexes enforce uniqueness of keys across all 
partitions of a table i.e guarantees that exactly
-  one record exists in the table for a given record key. Global indexes offer 
stronger guarantees, but the update/delete cost grows
-  with size of the table `O(size of table)`, which might still be acceptable 
for smaller tables.
+  one record exists in the table for a given record key. Global indexes offer 
stronger guarantees, but the update/delete 
+  cost can still grows with size of the table `O(size of table)`, which might 
still be acceptable for smaller tables. For
+  larger tables, a newly added index - Record level index(RLI), can be 
leveraged for fast upsert/delete performance. RLI 

Review Comment:
   @nsivabalan  please review this part for RLI and suggest edits



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bhasudha commented on a diff in pull request #9346: [DOCS] Update Indexing page with all index types and file layout page

Reply via email to