Lavan Vivekanandasarma created SPARK-56854:
----------------------------------------------
Summary: Filter None values in PySpark
DataFrame[Stream]Reader/Writer .option(s) for parity with Spark Connect
Key: SPARK-56854
URL: https://issues.apache.org/jira/browse/SPARK-56854
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 5.0.0
Reporter: Lavan Vivekanandasarma
Classic PySpark's DataFrame[Stream]Reader/Writer.option(key, None) and
.options(**\{k: None}) forward None to the JVM as Java null. This diverges
from the Spark Connect Python client (which has filtered None since
SPARK-49263) and from OptionUtils._set_opts at
python/pyspark/sql/readwriter.py:41-53, which already filters None.
Example: spark.read.options(nullValue=None).schema("a STRING, b
STRING").csv(path)
For a row '"",val', Classic returns [Row(a='', b='val')] while Connect
returns [Row(a=None, b='val')].
Proposal: filter None from the public option and options methods on
DataFrameReader, DataFrameWriter, DataFrameWriterV2, DataStreamReader,
and DataStreamWriter, so Classic matches Connect and _set_opts. After
the change, option(k, None) is a no-op and options(**\{k: None}) drops
None entries before forwarding.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]