[ 
https://issues.apache.org/jira/browse/SPARK-53662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Kahn updated SPARK-53662:
--------------------------------
    Description: 
With RDDs unavailable under Spark Connect, the Pandas API is the only real way 
to exfiltrate records into Python. However, as of Spark 4, Pandas API access 
raises `{{PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE}}` for even read-only 
operations on a Spark DataFrame.

Read-only operations, such as {{loc}} getters, dataframe subsets, etc, should 
not raise those errors.

For example:

{code:python}
# [ Spark Operations]
# Read the first column
for myValue in df.pandas_api().iloc[:, 0].to_numpy():
    # Do pure python transformations ....
{code}



should not raise an exception. Requiring a user to turn ANSI on and off for 
each of these operations is unreasonable and just will result in ANSI being 
turned off altogether.

If that's unreasonable, a Spark Connect compatible way to load all/subset of 
results into driver memory needs to be available without twisting oneself into 
knots.

  was:
With RDDs unavailable under Spark Connect, the Pandas API is the only real way 
to exfiltrate records into Python. However, as of Spark 4, Pandas API access 
raises `PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE` for even read-only operations on 
a Spark DataFrame. 

Read-only operations, such as `loc` getters, dataframe subsets, etc, should not 
raise those errors.

For example:


```
# [ Spark Operations]
# Read the first column
for myValue in df.pandas_api().iloc[:, 0].to_numpy():
    # Do pure python transformations ....
```

should not raise an exception. Requiring a user to turn ANSI on and off for 
each of these operations is unreasonable and just will result in ANSI being 
turned off altogether. 

If that's unreasonable, a Spark Connect compatible way to load all/subset of 
results into driver memory needs to be available without twisting oneself into 
knots.


> PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE should not trigger on materializing 
> results
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-53662
>                 URL: https://issues.apache.org/jira/browse/SPARK-53662
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark
>    Affects Versions: 4.0.0
>         Environment: Databricks Runtime v17.1
>            Reporter: Philip Kahn
>            Priority: Major
>
> With RDDs unavailable under Spark Connect, the Pandas API is the only real 
> way to exfiltrate records into Python. However, as of Spark 4, Pandas API 
> access raises `{{PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE}}` for even read-only 
> operations on a Spark DataFrame.
> Read-only operations, such as {{loc}} getters, dataframe subsets, etc, should 
> not raise those errors.
> For example:
> {code:python}
> # [ Spark Operations]
> # Read the first column
> for myValue in df.pandas_api().iloc[:, 0].to_numpy():
>     # Do pure python transformations ....
> {code}
> should not raise an exception. Requiring a user to turn ANSI on and off for 
> each of these operations is unreasonable and just will result in ANSI being 
> turned off altogether.
> If that's unreasonable, a Spark Connect compatible way to load all/subset of 
> results into driver memory needs to be available without twisting oneself 
> into knots.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to