[PR] perf(kafka-connect): memoize file id per partition path in the connect writer [hudi]

via GitHub Tue, 16 Jun 2026 00:25:53 -0700


wombatu-kun opened a new pull request, #19016:
URL: https://github.com/apache/hudi/pull/19016


   ### Describe the issue this Pull Request addresses
   
   `AbstractConnectWriter.writeRecord` computes the Hudi file id per record via 
`KafkaConnectUtils.hashDigest(String.format("%s-%s", record.kafkaPartition(), 
partitionPath))`, which builds a formatted string, runs 
`MessageDigest.getInstance("MD5")`, digests, and hex-encodes on every record. 
The file id depends only on `(kafkaPartition, partitionPath)`, and 
`kafkaPartition` is invariant for a given writer because a participant is bound 
to a single `TopicPartition`, so the same digest is recomputed redundantly for 
every record that shares a partition path.
   
   ### Summary and Changelog
   
   Memoize the file id in a `HashMap` keyed by partition path, computing the 
MD5 digest only on the first record seen for each partition path and reusing it 
afterwards. The hashed input stays `kafkaPartition + "-" + partitionPath` 
(concatenation, byte-identical to the previous `String.format("%s-%s", ...)`), 
so the produced file id, which determines on-disk file grouping, is unchanged. 
The cache lives on the writer instance, which is created per commit and used by 
a single thread, so no synchronization is needed. This change is independent of 
#19015 (AvroConvertor reuse); both touch `writeRecord`, so whichever merges 
second needs a trivial rebase.
   
   ### Impact
   
   Performance only; no public API or behavior change. Local JMH 
micro-benchmark of `writeRecord` (AverageTime mode, gc profiler, caching schema 
provider), measured independently against master:
   
   | Path | Baseline ns/op | After ns/op | Baseline B/op | After B/op |
   |------|---------------:|------------:|--------------:|-----------:|
   | AvroConverter | 2574 | 1807 (-30%) | 10602 | 9698 (-904 B) |
   | StringConverter (JSON) | 18419 | 17279 (-6%) | 33040 | 32136 (-904 B) |
   
   The change removes a fixed ~767 ns and ~904 B per record (the MD5 digest, 
the hex string, and the `String.format` temporaries). With a typical 
low-cardinality partition layout the digest collapses to one computation per 
partition path. Benchmark code is not included in this PR.
   
   ### Risk Level
   
   low
   
   The file id string is byte-identical to before, and the cache is scoped to a 
single-threaded, per-commit writer so it needs no synchronization. Covered by 
the existing `hudi-kafka-connect` unit tests; `TestAbstractConnectWriter` 
validates the produced record keys for both the Avro and JSON paths and passes.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(kafka-connect): memoize file id per partition path in the connect writer [hudi]

Reply via email to