wombatu-kun opened a new pull request, #19016:
URL: https://github.com/apache/hudi/pull/19016
### Describe the issue this Pull Request addresses
`AbstractConnectWriter.writeRecord` computes the Hudi file id per record via
`KafkaConnectUtils.hashDigest(String.format("%s-%s", record.kafkaPartition(),
partitionPath))`, which builds a formatted string, runs
`MessageDigest.getInstance("MD5")`, digests, and hex-encodes on every record.
The file id depends only on `(kafkaPartition, partitionPath)`, and
`kafkaPartition` is invariant for a given writer because a participant is bound
to a single `TopicPartition`, so the same digest is recomputed redundantly for
every record that shares a partition path.
### Summary and Changelog
Memoize the file id in a `HashMap` keyed by partition path, computing the
MD5 digest only on the first record seen for each partition path and reusing it
afterwards. The hashed input stays `kafkaPartition + "-" + partitionPath`
(concatenation, byte-identical to the previous `String.format("%s-%s", ...)`),
so the produced file id, which determines on-disk file grouping, is unchanged.
The cache lives on the writer instance, which is created per commit and used by
a single thread, so no synchronization is needed. This change is independent of
#19015 (AvroConvertor reuse); both touch `writeRecord`, so whichever merges
second needs a trivial rebase.
### Impact
Performance only; no public API or behavior change. Local JMH
micro-benchmark of `writeRecord` (AverageTime mode, gc profiler, caching schema
provider), measured independently against master:
| Path | Baseline ns/op | After ns/op | Baseline B/op | After B/op |
|------|---------------:|------------:|--------------:|-----------:|
| AvroConverter | 2574 | 1807 (-30%) | 10602 | 9698 (-904 B) |
| StringConverter (JSON) | 18419 | 17279 (-6%) | 33040 | 32136 (-904 B) |
The change removes a fixed ~767 ns and ~904 B per record (the MD5 digest,
the hex string, and the `String.format` temporaries). With a typical
low-cardinality partition layout the digest collapses to one computation per
partition path. Benchmark code is not included in this PR.
### Risk Level
low
The file id string is byte-identical to before, and the cache is scoped to a
single-threaded, per-commit writer so it needs no synchronization. Covered by
the existing `hudi-kafka-connect` unit tests; `TestAbstractConnectWriter`
validates the produced record keys for both the Avro and JSON paths and passes.
### Documentation Update
none
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]