[PR] [SPARK-56837][PYTHON][TESTS] Pass ArrowBatchedUDF benchmark input_type via EvalConf [spark]

via GitHub Tue, 12 May 2026 11:58:03 -0700


Yicong-Huang opened a new pull request, #55834:
URL: https://github.com/apache/spark/pull/55834


   ### What changes were proposed in this pull request?
   
   `_ArrowBatchedBenchMixin._write_scenario` in 
`python/benchmarks/bench_eval_type.py` wrote the `input_type` schema JSON as a 
length-prefixed UTF-8 string before the UDF payload. This was the old 
wire-protocol shape. Since 
[SPARK-56340](https://issues.apache.org/jira/browse/SPARK-56340) (move 
input_type schema to eval conf), the worker reads `input_type` via `EvalConf` 
instead, so the extra prefix gets parsed as the UDF count and the worker exits 
with `UnicodeDecodeError` while reading subsequent UTF-8 fields.
   
   This PR moves the schema to `eval_conf={"input_type": schema.json()}`, 
matching the pattern already used by the `_ArrowTableUDFBenchMixin`.
   
   ### Why are the changes needed?
   
   Running any `ArrowBatchedUDFTimeBench` / `ArrowBatchedUDFPeakmemBench` ASV 
benchmark currently fails with:
   
   ```
   File "pyspark/worker.py", line 3581, in main
       init_info = WorkerInitInfo.from_stream(infile)
     ...
   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 353: 
invalid start byte
   ```
   
   The bench file is the only `SQL_ARROW_BATCHED_UDF` mock writer in the tree 
and was missed when the worker protocol changed.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Test-only change.
   
   ### How was this patch tested?
   
   Running both bench classes locally now succeeds. Numbers from one run:
   
   ```text
   === bench_eval_type.ArrowBatchedUDFTimeBench.time_worker ===
   scenario             identity_udf   stringify_udf   nullcheck_udf
   sm_batch_few_col      44.3+/-0.3ms    46.9+/-0.3ms    45.0+/-0.4ms
   sm_batch_many_col     112+/-0.7ms     113+/-1ms       112+/-0.5ms
   lg_batch_few_col      106+/-0.7ms     113+/-2ms       106+/-0.4ms
   lg_batch_many_col     448+/-1ms       449+/-0.3ms     447+/-3ms
   pure_ints             157+/-1ms       162+/-1ms       156+/-2ms
   pure_floats           148+/-0.2ms     170+/-1ms       149+/-2ms
   pure_strings          302+/-0.5ms     305+/-3ms       295+/-0.7ms
   mixed_types           226+/-0.9ms     230+/-1ms       222+/-0.9ms
   
   === bench_eval_type.ArrowBatchedUDFPeakmemBench.peakmem_worker ===
   scenario             identity_udf   stringify_udf   nullcheck_udf
   sm_batch_few_col      464M           464M            464M
   sm_batch_many_col     469M           469M            469M
   lg_batch_few_col      469M           470M            469M
   lg_batch_many_col     509M           510M            509M
   pure_ints             469M           470M            469M
   pure_floats           469M           470M            469M
   pure_strings          473M           473M            473M
   mixed_types           471M           471M            470M
   ```
   
   Run commands:
   
   ```bash
   COLUMNS=120 asv run --bench ArrowBatchedUDFTimeBench   -a repeat=3 
--python=same
   COLUMNS=120 asv run --bench ArrowBatchedUDFPeakmemBench -a repeat=3 
--python=same
   ```
   
   Smoke-tested all 40 benchmark classes in the file (every other class still 
passes; only the two ArrowBatched classes were broken).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56837][PYTHON][TESTS] Pass ArrowBatchedUDF benchmark input_type via EvalConf [spark]

Reply via email to