Yicong-Huang opened a new pull request, #55601:
URL: https://github.com/apache/spark/pull/55601
### What changes were proposed in this pull request?
Adds ASV microbenchmark coverage for `SQL_MAP_PANDAS_ITER_UDF` (the eval
type backing `DataFrame.mapInPandas`). Two new classes are added to
`python/benchmarks/bench_eval_type.py`:
- `MapPandasIterUDFTimeBench`
- `MapPandasIterUDFPeakmemBench`
The benchmark mirrors the SQL_MAP_ARROW_ITER_UDF setup: input batches are
wrapped in a single struct column to match the JVM-side wire format produced by
`MapInBatchEvaluatorFactory` (which wraps each row in `InternalRow(_)` and
sends a `StructField("struct", inputSchema)`). On the Python side,
`ArrowStreamPandasUDFSerializer` with `df_for_struct=True` deserializes that
struct column back into `pandas.DataFrame` batches for the UDF iterator.
Three UDFs (`identity_udf`, `sort_udf`, `filter_udf`) are exercised across 9
scenarios spanning small/large batches with few/many columns, pure-type pools
(ints/floats/strings/timestamps), and mixed types. Row counts are scaled down
~5x relative to `SQL_MAP_ARROW_ITER_UDF` to account for per-batch
Arrow<->Pandas conversion overhead.
### Why are the changes needed?
This is part of the PySpark eval-type micro-benchmark coverage effort under
SPARK-55724. A baseline benchmark for `SQL_MAP_PANDAS_ITER_UDF` is required
before the upcoming refactor that consolidates per-eval-type serializer logic
into `read_udfs()` (so we can demonstrate no regression on the refactor PR).
### Does this PR introduce _any_ user-facing change?
No. This is a benchmark-only change.
### How was this patch tested?
Ran the new benchmark twice locally with `COLUMNS=120 asv run --bench
MapPandasIter --quick --python=same`; numbers were stable across runs.
```text
=================== ============== ========== ============
-- udf
------------------- --------------------------------------
scenario identity_udf sort_udf filter_udf
=================== ============== ========== ============
sm_batch_few_col 355+/-0ms 398+/-0ms 337+/-0ms
sm_batch_many_col 223+/-0ms 265+/-0ms 246+/-0ms
lg_batch_few_col 704+/-0ms 852+/-0ms 807+/-0ms
lg_batch_many_col 782+/-0ms 1.20+/-0s 831+/-0ms
pure_ints 164+/-0ms 177+/-0ms 166+/-0ms
pure_floats 164+/-0ms 187+/-0ms 173+/-0ms
pure_strings 576+/-0ms 862+/-0ms 647+/-0ms
pure_ts 240+/-0ms 248+/-0ms 237+/-0ms
mixed_types 420+/-0ms 494+/-0ms 446+/-0ms
=================== ============== ========== ============
```
```text
=================== ============== ========== ============
-- udf
------------------- --------------------------------------
scenario identity_udf sort_udf filter_udf
=================== ============== ========== ============
sm_batch_few_col 473M 473M 473M
sm_batch_many_col 477M 478M 477M
lg_batch_few_col 522M 523M 518M
lg_batch_many_col 564M 563M 567M
pure_ints 484M 485M 484M
pure_floats 497M 498M 497M
pure_strings 511M 512M 511M
pure_ts 501M 501M 501M
mixed_types 494M 496M 494M
=================== ============== ========== ============
```
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]