Yicong-Huang opened a new pull request, #54293: URL: https://github.com/apache/spark/pull/54293
### What changes were proposed in this pull request? Pass explicit Spark schema (derived from each Arrow table's schema via `from_arrow_schema`) to `ArrowBatchTransformer.to_pandas()` in `CogroupPandasUDFSerializer.load_stream()`, instead of passing `None` (the inherited `_input_type`). ### Why are the changes needed? `CogroupPandasUDFSerializer` is constructed without `input_type`, so `_input_type` defaults to `None`. When `to_pandas()` receives `schema=None`, it infers the Spark schema from the Arrow batch internally via `from_arrow_type()`. This works, but: 1. The same `None` is used for both left and right DataFrames, which is conceptually wrong since they can have different schemas. 2. The schema inference is implicit rather than explicit. 3. Other serializers like `ArrowBatchUDFSerializer` receive and pass explicit schemas. This was raised in https://github.com/apache/spark/pull/53963#discussion_r2167770076. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests in `test_pandas_cogrouped_map.py`. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
