danny0405 commented on code in PR #12884: URL: https://github.com/apache/hudi/pull/12884#discussion_r1978402746
########## rfc/rfc-89/rfc-89.md: ########## @@ -0,0 +1,345 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-89: Partition Level Bucket Index + +## Proposers +- @zhangyue19921010 + +## Approvers +- @danny0405 +- @codope +- @xiarixiaoyao + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-8990 + +## Abstract + +As we know, Hudi proposed and introduced Bucket Index in RFC-29. Bucket Index can well unify the indexes of Flink and +Spark, that is, Spark and Flink could upsert the same Hudi table using bucket index. + +However, Bucket Index has a limit of fixed number of buckets. In order to solve this problem, RFC-42 proposed the ability +of consistent hashing achieving bucket resizing by splitting or merging several local buckets dynamically. + +But from PRD experience, sometimes a Partition-Level Bucket Index and a offline way to do bucket rescale is good enough +without introducing additional efforts (multiple writes, clustering, automatic resizing,etc.). Because the more complex +the Architecture, the more error-prone it is and the greater operation and maintenance pressure. + +In this regard, we could upgrade the traditional Bucket Index to implement a Partition-Level Bucket Index, so that users +can set a specific number of buckets for different partitions through a rule engine (such as regular expression matching). +On the other hand, for a certain existing partitions, an off-line command is provided to reorganized the data using insert Review Comment: reorganized -> reorganize, is it possible to automate the data rewrite process? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
