[
https://issues.apache.org/jira/browse/SPARK-53662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Philip Kahn updated SPARK-53662:
--------------------------------
Description:
With RDDs unavailable under Spark Connect, the Pandas API is the only real way
to exfiltrate records into Python. However, as of Spark 4, Pandas API access
raises `{{PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE}}` for even read-only
operations on a Spark DataFrame.
Read-only operations, such as {{loc}} getters, dataframe subsets, etc, should
not raise those errors.
For example:
{code:python}
# [ Spark Operations]
# Read the first column
for myValue in df.pandas_api().iloc[:, 0].to_numpy():
# Do pure python transformations ....
{code}
should not raise an exception. Requiring a user to turn ANSI on and off for
each of these operations is unreasonable and just will result in ANSI being
turned off altogether.
If that's unreasonable, a Spark Connect compatible way to load all/subset of
results into driver memory needs to be available without twisting oneself into
knots.
was:
With RDDs unavailable under Spark Connect, the Pandas API is the only real way
to exfiltrate records into Python. However, as of Spark 4, Pandas API access
raises `PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE` for even read-only operations on
a Spark DataFrame.
Read-only operations, such as `loc` getters, dataframe subsets, etc, should not
raise those errors.
For example:
```
# [ Spark Operations]
# Read the first column
for myValue in df.pandas_api().iloc[:, 0].to_numpy():
# Do pure python transformations ....
```
should not raise an exception. Requiring a user to turn ANSI on and off for
each of these operations is unreasonable and just will result in ANSI being
turned off altogether.
If that's unreasonable, a Spark Connect compatible way to load all/subset of
results into driver memory needs to be available without twisting oneself into
knots.
> PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE should not trigger on materializing
> results
> ---------------------------------------------------------------------------------
>
> Key: SPARK-53662
> URL: https://issues.apache.org/jira/browse/SPARK-53662
> Project: Spark
> Issue Type: Bug
> Components: Pandas API on Spark
> Affects Versions: 4.0.0
> Environment: Databricks Runtime v17.1
> Reporter: Philip Kahn
> Priority: Major
>
> With RDDs unavailable under Spark Connect, the Pandas API is the only real
> way to exfiltrate records into Python. However, as of Spark 4, Pandas API
> access raises `{{PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE}}` for even read-only
> operations on a Spark DataFrame.
> Read-only operations, such as {{loc}} getters, dataframe subsets, etc, should
> not raise those errors.
> For example:
> {code:python}
> # [ Spark Operations]
> # Read the first column
> for myValue in df.pandas_api().iloc[:, 0].to_numpy():
> # Do pure python transformations ....
> {code}
> should not raise an exception. Requiring a user to turn ANSI on and off for
> each of these operations is unreasonable and just will result in ANSI being
> turned off altogether.
> If that's unreasonable, a Spark Connect compatible way to load all/subset of
> results into driver memory needs to be available without twisting oneself
> into knots.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]