bibhu107 opened a new issue, #11194: URL: https://github.com/apache/hudi/issues/11194
We request the community to **Benchmark Record Level Indexing (RLI) with Simple Indexing**. The blog at https://hudi.apache.org/blog/2023/11/01/record-level-index/ provides a great comparison between RLI and Global Simple Indexing. However, we also need to understand how RLI compares with Simple Indexing, as RLI can be used for simple indexing in certain use cases, even though it's primarily designed for scenarios where record keys are unique across all partitions. Our current approach is to hash the `ContractId` (`hoodie_record_key`), take the first three letters as partitions, and apply simple indexing. However, this approach doesn't scale well due to data skewness. The problem is to evaluate if RLI is suitable for our use case. If RLI isn't suitable, we need suggestions for a better indexing strategy. Note: We currently use simple indexing instead of the costly global simple indexing. We can consider adopting RLI if it offers the same or reduced cost. A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : Currently using 0.7 might migrate to 0.14 * Spark version : 3.3.1 * Hive version : ApacheHive-3.1.3 * Hadoop version : Hadoop-3.3.4 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No **Additional context** : Running on , EMR-EC2 Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
