[
https://issues.apache.org/jira/browse/HUDI-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269828#comment-17269828
]
Mihir Shah commented on HUDI-1503:
----------------------------------
h4. [Shimin
Yang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=dangdangdang]
Hello Mr. Yang,
I would be interested in working on this issue, I was wondering if there is
some documentation about the index or the project's design so I could
understand the problem better?
Thank you!
> Implement a Hash(Bucket)-based Index
> ------------------------------------
>
> Key: HUDI-1503
> URL: https://issues.apache.org/jira/browse/HUDI-1503
> Project: Apache Hudi
> Issue Type: Wish
> Components: Index, Performance
> Reporter: Shimin Yang
> Priority: Major
>
> This ticket is to introduce a new hash based index, which can improve the
> performance of write operations and speed up the queries at the same
> time(removing shuffle for Spark/Hive).
> The new hash-based index works with a customized hash-based partitioner,
> which partition records based on the hash value of index keys and a fixed
> bucket number. So there's no need to visit the existing files to determine
> which file group each record belongs.
> Meanwhile, the file group id, hash mode and bucket num can be used by the
> query engines to eliminate shuffle introduced by aggregation and join.
> We implemented an HoodieIndex based on hive hash function which used on
> production environment of ByteDance for many very-large volume dataset, and
> we hope this feature can be contributed to the community soon.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)