[
https://issues.apache.org/jira/browse/HUDI-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raymond Xu updated HUDI-1503:
-----------------------------
Epic Link: HUDI-3000
> Implement a Hash(Bucket)-based Index
> ------------------------------------
>
> Key: HUDI-1503
> URL: https://issues.apache.org/jira/browse/HUDI-1503
> Project: Apache Hudi
> Issue Type: Wish
> Components: index, Performance
> Reporter: Shimin Yang
> Priority: Major
>
> This ticket is to introduce a new hash based index, which can improve the
> performance of write operations and speed up the queries at the same
> time(removing shuffle for Spark/Hive).
> The new hash-based index works with a customized hash-based partitioner,
> which partition records based on the hash value of index keys and a fixed
> bucket number. So there's no need to visit the existing files to determine
> which file group each record belongs.
> Meanwhile, the file group id, hash mode and bucket num can be used by the
> query engines to eliminate shuffle introduced by aggregation and join.
> We implemented an HoodieIndex based on hive hash function which used on
> production environment of ByteDance for many very-large volume dataset, and
> we hope this feature can be contributed to the community soon.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)