Philip Kahn created SPARK-53662:
-----------------------------------
Summary: PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE should not trigger
on materializing results
Key: SPARK-53662
URL: https://issues.apache.org/jira/browse/SPARK-53662
Project: Spark
Issue Type: Bug
Components: Pandas API on Spark
Affects Versions: 4.0.0
Environment: Databricks Runtime v17.1
Reporter: Philip Kahn
With RDDs unavailable under Spark Connect, the Pandas API is the only real way
to exfiltrate records into Python. However, as of Spark 4, Pandas API access
raises `PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE` for even read-only operations on
a Spark DataFrame.
Read-only operations, such as `loc` getters, dataframe subsets, etc, should not
raise those errors.
For example:
```
# [ Spark Operations]
# Read the first column
for myValue in df.pandas_api().iloc[:, 0].to_numpy():
# Do pure python transformations ....
```
should not raise an exception. Requiring a user to turn ANSI on and off for
each of these operations is unreasonable and just will result in ANSI being
turned off altogether.
If that's unreasonable, a Spark Connect compatible way to load all/subset of
results into driver memory needs to be available without twisting oneself into
knots.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]