Philip Kahn created SPARK-53662:
-----------------------------------

             Summary: PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE should not trigger 
on materializing results
                 Key: SPARK-53662
                 URL: https://issues.apache.org/jira/browse/SPARK-53662
             Project: Spark
          Issue Type: Bug
          Components: Pandas API on Spark
    Affects Versions: 4.0.0
         Environment: Databricks Runtime v17.1
            Reporter: Philip Kahn


With RDDs unavailable under Spark Connect, the Pandas API is the only real way 
to exfiltrate records into Python. However, as of Spark 4, Pandas API access 
raises `PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE` for even read-only operations on 
a Spark DataFrame. 

Read-only operations, such as `loc` getters, dataframe subsets, etc, should not 
raise those errors.

For example:


```
# [ Spark Operations]
# Read the first column
for myValue in df.pandas_api().iloc[:, 0].to_numpy():
    # Do pure python transformations ....
```

should not raise an exception. Requiring a user to turn ANSI on and off for 
each of these operations is unreasonable and just will result in ANSI being 
turned off altogether. 

If that's unreasonable, a Spark Connect compatible way to load all/subset of 
results into driver memory needs to be available without twisting oneself into 
knots.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to