[ 
https://issues.apache.org/jira/browse/HUDI-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1503:
-----------------------------
    Epic Link: HUDI-3000

> Implement a Hash(Bucket)-based Index
> ------------------------------------
>
>                 Key: HUDI-1503
>                 URL: https://issues.apache.org/jira/browse/HUDI-1503
>             Project: Apache Hudi
>          Issue Type: Wish
>          Components: index, Performance
>            Reporter: Shimin Yang
>            Priority: Major
>
> This ticket is to introduce a new hash based index, which can improve the 
> performance of  write operations and speed up the queries at the same 
> time(removing shuffle for Spark/Hive).
> The new hash-based index works with a customized hash-based partitioner, 
> which partition records based on the hash value of index keys and a fixed 
> bucket number. So there's no need to visit the existing files to determine 
> which file group each record belongs.
> Meanwhile, the file group id, hash mode and bucket num can be used by the 
> query engines to eliminate shuffle introduced by aggregation and join.
> We implemented an HoodieIndex based on hive hash function which used on 
> production environment of ByteDance for many very-large volume dataset, and 
> we hope this feature can be contributed to the community soon.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to