fhan688 opened a new pull request, #3316: URL: https://github.com/apache/fluss/pull/3316
### Purpose <!-- Linking this pull request to the issue --> Linked issue: https://github.com/apache/fluss/issues/3274 Introduce Hudi's bucketing strategy into Fluss so that the Fluss server/client can compute the same bucket id as Hudi's BucketIdentifier when tiering data into a Hudi table with bucket index. This is a prerequisite for the upcoming HudiLakeWriter and HudiCompaction PRs, which need to route records to the correct Hudi bucket file. ### Brief change log fluss-common (production code) - BucketingFunction.of(...) — add DataLakeFormat.HUDI branch that returns HudiBucketingFunction. - HudiBucketingFunction — implements BucketingFunction. Decodes a 4-byte big-endian int produced by HudiKeyEncoder and computes (hash & Integer.MAX_VALUE) % numBuckets, matching Hudi's BucketIdentifier.getBucketId(List<String>, int). Includes strict input validation (bucketKey must be exactly 4 bytes, numBuckets must be positive). - KeyEncoder.createKeyEncoder(...) — add DataLakeFormat.HUDI branch that returns HudiKeyEncoder. - HudiKeyEncoder — implements KeyEncoder. Computes List<String>.hashCode() inline (h = 31*h + elementStringHash) over the stringified key fields, avoiding intermediate ArrayList/String.valueOf allocations on the hot path. For common numeric types (int, long, byte, short, boolean) the string hash code is computed without materializing the string. Null fields are encoded as "__null__" placeholder (aligned with Hudi's KeyGenUtils.NULL_RECORDKEY_PLACEHOLDER) to avoid collision with the literal string "null". fluss-lake-hudi (test & build) - HudiBucketingFunctionTest — 13 test cases covering: 1. Single-field types: INT, BIGINT, STRING, DECIMAL, TIMESTAMP_NTZ 2. Additional types: BOOLEAN, TINYINT, SMALLINT, FLOAT, DATE, TIME, TIMESTAMP_LTZ 3. Composite (multi-field) bucket keys with and without null fields 4. Null field uses placeholder (not literal "null") — regression test 5. Illegal input: bucketKey null / wrong length / numBuckets ≤ 0 6. Boundary: numBuckets=1, Integer.MIN_VALUE hash, negative hash sign-bit handling 7. All tests cross-validate against Hudi's BucketIdentifier.getBucketId(List<String>, int) - pom.xml — add hudi-flink${flink.major.version}-bundle with <scope>test</scope> so it is only available during unit tests and does not leak into the runtime classpath. ### Tests HudiBucketingFunctionTest (13 test cases, all passing): - testIntegerHash / testLongHash / testStringHash / testDecimalHash / testTimestampEncodingHash — original single-field coverage - testNullFieldUsesPlaceholder / testNullFieldDoesNotCollideWithLiteralNullString — null handling - testBucketingRejectsInvalidBucketKey / testBucketingRejectsNonPositiveNumBuckets — input validation - testCompositeBucketKeyMatchesHudiFieldValueRecordKey / testCompositeBucketKeyWithNullFieldUsesPlaceholder — multi-field keys - testBooleanAndIntegralTypes / testDateAndTimeTypes / testTimestampLtzType — type coverage - testBucketingNumBucketsBoundaryValues — boundary conditions ### API and Format No API or storage format changes. This PR only adds new implementations behind existing interfaces (BucketingFunction and KeyEncoder) for a new DataLakeFormat.HUDI enum value that was already defined. ### Documentation No new user-facing documentation required. This is an internal bucketing strategy implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
