[PR] [SPARK-56657][PYTHON][TESTS] Add ASV microbenchmark for SQL_MAP_PANDAS_ITER_UDF [spark]

via GitHub Wed, 29 Apr 2026 00:14:23 -0700


Yicong-Huang opened a new pull request, #55601:
URL: https://github.com/apache/spark/pull/55601


   ### What changes were proposed in this pull request?
   
   Adds ASV microbenchmark coverage for `SQL_MAP_PANDAS_ITER_UDF` (the eval 
type backing `DataFrame.mapInPandas`). Two new classes are added to 
`python/benchmarks/bench_eval_type.py`:
   
   - `MapPandasIterUDFTimeBench`
   - `MapPandasIterUDFPeakmemBench`
   
   The benchmark mirrors the SQL_MAP_ARROW_ITER_UDF setup: input batches are 
wrapped in a single struct column to match the JVM-side wire format produced by 
`MapInBatchEvaluatorFactory` (which wraps each row in `InternalRow(_)` and 
sends a `StructField("struct", inputSchema)`). On the Python side, 
`ArrowStreamPandasUDFSerializer` with `df_for_struct=True` deserializes that 
struct column back into `pandas.DataFrame` batches for the UDF iterator.
   
   Three UDFs (`identity_udf`, `sort_udf`, `filter_udf`) are exercised across 9 
scenarios spanning small/large batches with few/many columns, pure-type pools 
(ints/floats/strings/timestamps), and mixed types. Row counts are scaled down 
~5x relative to `SQL_MAP_ARROW_ITER_UDF` to account for per-batch 
Arrow<->Pandas conversion overhead.
   
   ### Why are the changes needed?
   
   This is part of the PySpark eval-type micro-benchmark coverage effort under 
SPARK-55724. A baseline benchmark for `SQL_MAP_PANDAS_ITER_UDF` is required 
before the upcoming refactor that consolidates per-eval-type serializer logic 
into `read_udfs()` (so we can demonstrate no regression on the refactor PR).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is a benchmark-only change.
   
   ### How was this patch tested?
   
   Ran the new benchmark twice locally with `COLUMNS=120 asv run --bench 
MapPandasIter --quick --python=same`; numbers were stable across runs.
   
   ```text
   =================== ============== ========== ============
   --                                   udf
   ------------------- --------------------------------------
         scenario       identity_udf   sort_udf   filter_udf
   =================== ============== ========== ============
     sm_batch_few_col     355+/-0ms    398+/-0ms    337+/-0ms
    sm_batch_many_col     223+/-0ms    265+/-0ms    246+/-0ms
     lg_batch_few_col     704+/-0ms    852+/-0ms    807+/-0ms
    lg_batch_many_col     782+/-0ms    1.20+/-0s    831+/-0ms
        pure_ints         164+/-0ms    177+/-0ms    166+/-0ms
       pure_floats        164+/-0ms    187+/-0ms    173+/-0ms
      pure_strings        576+/-0ms    862+/-0ms    647+/-0ms
         pure_ts          240+/-0ms    248+/-0ms    237+/-0ms
       mixed_types        420+/-0ms    494+/-0ms    446+/-0ms
   =================== ============== ========== ============
   ```
   
   ```text
   =================== ============== ========== ============
   --                                   udf
   ------------------- --------------------------------------
         scenario       identity_udf   sort_udf   filter_udf
   =================== ============== ========== ============
     sm_batch_few_col       473M         473M        473M
    sm_batch_many_col       477M         478M        477M
     lg_batch_few_col       522M         523M        518M
    lg_batch_many_col       564M         563M        567M
        pure_ints           484M         485M        484M
       pure_floats          497M         498M        497M
      pure_strings          511M         512M        511M
         pure_ts            501M         501M        501M
       mixed_types          494M         496M        494M
   =================== ============== ========== ============
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56657][PYTHON][TESTS] Add ASV microbenchmark for SQL_MAP_PANDAS_ITER_UDF [spark]

Reply via email to