Yicong Huang created SPARK-55821:
------------------------------------
Summary: [PYTHON] Enforce keyword-only arguments in serializer
__init__ methods
Key: SPARK-55821
URL: https://issues.apache.org/jira/browse/SPARK-55821
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.0.0
Reporter: Yicong Huang
The serializer classes in `pyspark.sql.pandas.serializers` accept many
positional arguments in their `__init__` methods, making call sites error-prone
and hard to read.
For example, `ArrowStreamPandasUDFSerializer.__init__` takes 12 parameters,
`ApplyInPandasWithStateSerializer.__init__` takes 7 parameters, etc. When these
are called with positional arguments, it is very easy to mix up the order.
We should enforce keyword-only arguments (using `*` separator after `self`) in
serializer `__init__` methods to improve readability and prevent positional
argument mistakes.
Classes to update:
- `ArrowStreamPandasSerializer`
- `ArrowStreamPandasUDFSerializer`
- `ArrowStreamArrowUDFSerializer`
- `ArrowBatchUDFSerializer`
- `ArrowStreamPandasUDTFSerializer`
- `ArrowStreamAggPandasUDFSerializer`
- `GroupPandasUDFSerializer`
- `CogroupPandasUDFSerializer`
- `ApplyInPandasWithStateSerializer`
- `TransformWithStateInPandasSerializer`
- `TransformWithStateInPandasInitStateSerializer`
- `ArrowStreamGroupUDFSerializer`
- `CogroupArrowUDFSerializer`
All call sites in `worker.py` and within `serializers.py` (subclass
`super().__init__` calls) must also be updated to use keyword arguments.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]