This is an automated email from the ASF dual-hosted git repository.

viirya pushed a commit to branch branch-4.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-4.2 by this push:
     new f9a3a9bf0318 [SPARK-55754][PYTHON][TEST][FOLLOWUP] Fix pure_ints type 
mismatch in bench
f9a3a9bf0318 is described below

commit f9a3a9bf0318a21dfee2e3c6ed8a6e43739a273d
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Wed May 27 23:52:57 2026 -0700

    [SPARK-55754][PYTHON][TEST][FOLLOWUP] Fix pure_ints type mismatch in bench
    
    ### What changes were proposed in this pull request?
    
    Refactor `MockDataFactory.NAMED_TYPE_POOLS` in 
`python/benchmarks/bench_eval_type.py` so the `pure_ints`, `pure_floats`, and 
`pure_strings` entries reuse the corresponding `TYPE_REGISTRY` entries instead 
of duplicating their factory lambdas.
    
    ### Why are the changes needed?
    
    `NAMED_TYPE_POOLS[\"pure_ints\"]` declared the column as `IntegerType()` 
(32-bit) but generated data with `np.int64`. Because every benchmark that uses 
this pool runs through serializers with `arrow_cast=True`, the mismatch was 
silently corrected by a 64-to-32 narrowing cast inside the pandas/arrow 
conversion path -- meaning the `pure_ints` scenario in seven mixins 
(`ArrowBatchedUDF`, `ArrowUDTF`, `ArrowTableUDF`, `MapArrowIterUDF`, 
`MapPandasIterUDF`, `ScalarArrowUDF`, `ScalarPandasU [...]
    
    `pure_floats` and `pure_strings` had no such mismatch but duplicated the 
same lambdas as `TYPE_REGISTRY[\"double\"]` / `TYPE_REGISTRY[\"string\"]`, 
risking drift in future edits. Reusing the registry entries eliminates the 
duplication. `pure_ts` is left as-is because no matching `TYPE_REGISTRY` entry 
exists.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. Test-only change in the benchmark module.
    
    ### How was this patch tested?
    
    - Confirmed `NAMED_TYPE_POOLS[\"pure_ints\"][0]` now produces a 
`pa.int32()` array matching its `IntegerType()` declaration (was `pa.int64()`).
    - Confirmed `pure_floats` and `pure_strings` still produce `pa.float64()` 
and `pa.string()` arrays after the refactor.
    - Ran `setup` + `time_worker` for the `pure_ints` scenario across all seven 
affected `*TimeBench` classes; all passed.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Yes. Generated-by: Claude Code (claude-opus-4-7)
    
    Closes #56169 from viirya/SPARK-55724-pure-ints-followup.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    (cherry picked from commit fc5abd63c107e41a145239c28b3524176b94013f)
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
---
 python/benchmarks/bench_eval_type.py | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/python/benchmarks/bench_eval_type.py 
b/python/benchmarks/bench_eval_type.py
index 845b54021775..af6189a4560e 100644
--- a/python/benchmarks/bench_eval_type.py
+++ b/python/benchmarks/bench_eval_type.py
@@ -200,11 +200,9 @@ class MockDataFactory:
 
     NAMED_TYPE_POOLS: dict[str, list[tuple[Callable, Any]]] = {
         "mixed": MIXED_TYPES,
-        "pure_ints": [
-            (lambda r: pa.array(np.random.randint(0, 1000, r, 
dtype=np.int64)), IntegerType())
-        ],
-        "pure_floats": [(lambda r: pa.array(np.random.rand(r)), DoubleType())],
-        "pure_strings": [(lambda r: pa.array([f"s{j}" for j in range(r)]), 
StringType())],
+        "pure_ints": [TYPE_REGISTRY["int"]],
+        "pure_floats": [TYPE_REGISTRY["double"]],
+        "pure_strings": [TYPE_REGISTRY["string"]],
         "pure_ts": [
             (
                 lambda r: pa.array(


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to