[PR] perf(spark): restore Catalyst fast-paths for SimpleKeyGen/Nonpartitioned bulk insert [hudi]

via GitHub Fri, 12 Jun 2026 14:10:30 -0700


nsivabalan opened a new pull request, #18990:
URL: https://github.com/apache/hudi/pull/18990


   ### Describe the issue this Pull Request addresses
   
   Closes #18989.
   
   ### Summary and Changelog
   
   `HoodieDatasetBulkInsertHelper.prepareForBulkInsert` was routing every 
key-generator through `df.queryExecution.toRdd.mapPartitions(...)`, forcing an 
RDD round-trip and per-row reflection-based keygen invocation even for the 
common keygens where the record-key and partition-path values can be sourced 
directly from input columns.
   
   This patch restores tiered dispatch:
   
   - **Tier 1 — `NonpartitionedKeyGenerator`** (single record-key field): emits 
`col(rk).cast(String)` + `lit("")` as Catalyst columns. No UDF, no toRdd 
round-trip.
   - **Tier 2 — `SimpleKeyGenerator`** (single record-key + single 
partition-path field, URL-encoding off, slash-separated dates off): emits 
`col(rk).cast(String)` and a partition-path expression mirroring 
`PartitionPathFormatterBase#combine`, including the `handleEmpty -> 
__HIVE_DEFAULT_PARTITION__` substitution and hive-style `<field>=` prefixing.
   - **Tier 3 — everything else** (multi-field keys, `ComplexKeyGenerator`, 
`TimestampBasedKeyGenerator`, `CustomKeyGenerator`, `SimpleKeyGenerator` with 
URL-encode or slash-separated dates): anonymous `functions.udf(...)` over a 
struct of input columns calling the canonical 
`BuiltinKeyGenerator.getRecordKey(Row)` / `getPartitionPath(Row)`. The UDFs are 
not registered against the `SparkSession`, so nothing leaks across writes.
   - **Auto-record-key generation** keeps the existing RDD path; it needs 
`TaskContext.partitionId` and a stateful per-task counter, which can't be 
expressed cleanly as a driver-side closure.
   
   The Tier 3 UDF goes through the `Row`-overload keygen API which uses the 
canonical `String` formatter, so all three partition-formatter flags 
(hive-style, URL encode, slash-separated dates) remain honored for the keygens 
that fall through. The Tier 2 fast-path encodes only the default and hive-style 
flag subset (URL encoding has no efficient pure-Catalyst equivalent; the 1.2.0+ 
slash-separated branch exercises a separate code path we'd rather not encode 
twice).
   
   New tests in `TestHoodieDatasetBulkInsertHelper`:
   
   - `testKeyGenParityAgainstAvroGroundTruth` (parameterized, 11 cases) — every 
supported keygen class plus the `SimpleKeyGen` flag combos (default / hive / 
slash / hive+slash / URL / hive+URL / Complex single+multi / TimestampBased / 
Custom). Each case asserts the helper's record-key and partition-path output 
matches `BuiltinKeyGenerator`'s Avro path byte-for-byte.
   - `testFastPathCastsNonStringRecordKey` — Tier 1/2 must materialize the 
string form of a non-string record-key column (uses `ts: long`).
   - `testFastPathAvoidsUdf` — Tier 1/2 analyzed logical plans must not contain 
a `ScalaUDF` node (i.e. they actually benefit from Catalyst codegen).
   - `testTier2EmptyPartitionValueSubstitutedWithHiveDefault` — empty partition 
values resolve to `__HIVE_DEFAULT_PARTITION__` under both default and 
hive-style flags.
   - `testUdfPathRespectsDriverSessionTimezone` — Tier 3 UDF picks up the 
driver's `spark.sql.session.timeZone` (guards against executor JVM default 
leakage on `TimestampBasedKeyGenerator`).
   
   ### Impact
   
   Performance: restores per-row Catalyst codegen for bulk inserts that use 
`NonpartitionedKeyGenerator` or `SimpleKeyGenerator` (with default or 
hive-style partitioning) — the most common configurations in practice. No 
behaviour change for the keygens that fall through to Tier 3; their output is 
byte-identical to the prior RDD path (and to the Avro ground truth, which the 
parity test now enforces).
   
   No public API change. No config change. No storage format change.
   
   ### Risk Level
   
   Low. The change is contained to 
`HoodieDatasetBulkInsertHelper.prepareForBulkInsert` (Scala helper, no public 
API surface) and the parity test exhaustively checks every keygen + formatter 
combination against the canonical Avro keygen output. The Tier 3 fallback is 
the existing RDD-replaced UDF path, so any keygen the fast paths don't claim 
continues to use the same canonical formatter.
   
   ### Documentation Update
   
   None.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(spark): restore Catalyst fast-paths for SimpleKeyGen/Nonpartitioned bulk insert [hudi]

Reply via email to