bibhu107 opened a new issue, #11194:
URL: https://github.com/apache/hudi/issues/11194

   
   We request the community to **Benchmark Record Level Indexing (RLI) with 
Simple Indexing**. The blog at 
https://hudi.apache.org/blog/2023/11/01/record-level-index/ provides a great 
comparison between RLI and Global Simple Indexing. However, we also need to 
understand how RLI compares with Simple Indexing, as RLI can be used for simple 
indexing in certain use cases, even though it's primarily designed for 
scenarios where record keys are unique across all partitions.
   
   Our current approach is to hash the `ContractId` (`hoodie_record_key`), take 
the first three letters as partitions, and apply simple indexing. However, this 
approach doesn't scale well due to data skewness.
   The problem is to evaluate if RLI is suitable for our use case. If RLI isn't 
suitable, we need suggestions for a better indexing strategy.
   
   Note: We currently use simple indexing instead of the costly global simple 
indexing. We can consider adopting RLI if it offers the same or reduced cost.
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : Currently using 0.7 might migrate to 0.14
   
   * Spark version : 3.3.1
   
   * Hive version : ApacheHive-3.1.3
   
   * Hadoop version : Hadoop-3.3.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context** : Running on , EMR-EC2
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to