Alex Khakhlyuk created SPARK-50537:
--------------------------------------
Summary: Fix compression option being overwritten in
df.write.parquet in SparkConnect Python
Key: SPARK-50537
URL: https://issues.apache.org/jira/browse/SPARK-50537
Project: Spark
Issue Type: Bug
Components: Connect
Affects Versions: 3.5.3, 3.4.4, 3.4.3, 3.5.2, 3.5.1, 3.5.0, 3.4.1, 3.4.0,
3.4.2, 4.0.0, 3.5.4
Reporter: Alex Khakhlyuk
There is a small bug in Spark Connect's {{{}DataFrameWriter{}}}.
df.write.option("compression", "gzip").parquet(path)
When this code is used, the specified compression option "gzip" gets
overwritten by None. This happens because {{parquet()}} function has a default
{{compression=None}} parameter which is used to set the compression option
directly
with {{{}self.option("compression", compression){}}}.
The Spark Connect server then receives a request without a specified
compression option and uses "snappy" compression by default instead.
The fix is to use the {{{}self._set_opts(compression=compression){}}}. This
method of setting options is used by most other {{DataFrameWriter}} APIs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]