fhan688 opened a new pull request, #3316:
URL: https://github.com/apache/fluss/pull/3316

   ### Purpose
   
   <!-- Linking this pull request to the issue -->
   Linked issue: https://github.com/apache/fluss/issues/3274
   
   Introduce Hudi's bucketing strategy into Fluss so that the Fluss 
server/client can compute the same bucket id as Hudi's BucketIdentifier when 
tiering data into a Hudi table with bucket index. This is a prerequisite for 
the upcoming HudiLakeWriter and HudiCompaction PRs, which need to route records 
to the correct Hudi bucket file.
   
   ### Brief change log
   
   fluss-common (production code)
   
   - BucketingFunction.of(...) — add DataLakeFormat.HUDI branch that returns 
HudiBucketingFunction.
   
   - HudiBucketingFunction — implements BucketingFunction. Decodes a 4-byte 
big-endian int produced by HudiKeyEncoder and computes (hash & 
Integer.MAX_VALUE) % numBuckets, matching Hudi's 
BucketIdentifier.getBucketId(List<String>, int). Includes strict input 
validation (bucketKey must be exactly 4 bytes, numBuckets must be positive).
   
   - KeyEncoder.createKeyEncoder(...) — add DataLakeFormat.HUDI branch that 
returns HudiKeyEncoder.
   
   - HudiKeyEncoder — implements KeyEncoder. Computes List<String>.hashCode() 
inline (h = 31*h + elementStringHash) over the stringified key fields, avoiding 
intermediate ArrayList/String.valueOf allocations on the hot path. For common 
numeric types (int, long, byte, short, boolean) the string hash code is 
computed without materializing the string. Null fields are encoded as 
"__null__" placeholder (aligned with Hudi's 
KeyGenUtils.NULL_RECORDKEY_PLACEHOLDER) to avoid collision with the literal 
string "null".
   
   fluss-lake-hudi (test & build)
   
   - HudiBucketingFunctionTest — 13 test cases covering:
   
   1. Single-field types: INT, BIGINT, STRING, DECIMAL, TIMESTAMP_NTZ
   
   2. Additional types: BOOLEAN, TINYINT, SMALLINT, FLOAT, DATE, TIME, 
TIMESTAMP_LTZ
   
   3. Composite (multi-field) bucket keys with and without null fields
   
   4. Null field uses placeholder (not literal "null") — regression test
   
   5. Illegal input: bucketKey null / wrong length / numBuckets ≤ 0
   
   6. Boundary: numBuckets=1, Integer.MIN_VALUE hash, negative hash sign-bit 
handling
   
   7. All tests cross-validate against Hudi's 
BucketIdentifier.getBucketId(List<String>, int)
   
   - pom.xml — add hudi-flink${flink.major.version}-bundle with 
<scope>test</scope> so it is only available during unit tests and does not leak 
into the runtime classpath.
   
   ### Tests
   
   HudiBucketingFunctionTest (13 test cases, all passing):
   
   - testIntegerHash / testLongHash / testStringHash / testDecimalHash / 
testTimestampEncodingHash — original single-field coverage
   
   - testNullFieldUsesPlaceholder / 
testNullFieldDoesNotCollideWithLiteralNullString — null handling
   
   - testBucketingRejectsInvalidBucketKey / 
testBucketingRejectsNonPositiveNumBuckets — input validation
   
   - testCompositeBucketKeyMatchesHudiFieldValueRecordKey / 
testCompositeBucketKeyWithNullFieldUsesPlaceholder — multi-field keys
   
   - testBooleanAndIntegralTypes / testDateAndTimeTypes / testTimestampLtzType 
— type coverage
   
   - testBucketingNumBucketsBoundaryValues — boundary conditions
   
   ### API and Format
   
   No API or storage format changes. This PR only adds new implementations 
behind existing interfaces (BucketingFunction and KeyEncoder) for a new 
DataLakeFormat.HUDI enum value that was already defined.
   
   ### Documentation
   
   No new user-facing documentation required. This is an internal bucketing 
strategy implementation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to