[
https://issues.apache.org/jira/browse/SPARK-41379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-41379:
------------------------------------
Assignee: Apache Spark
> Inconsistency of spark session in DataFrame in user function for foreachBatch
> sink in PySpark
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-41379
> URL: https://issues.apache.org/jira/browse/SPARK-41379
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Structured Streaming
> Affects Versions: 3.3.2, 3.4.0
> Reporter: Jungtaek Lim
> Assignee: Apache Spark
> Priority: Major
>
> [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html]
> According to some manual testing against above code example in PySpark, it
> seems like the property of sparkSession in given DataFrame is not the same
> with cloned session in streaming query. In other words, {{df.sparkSession}}
> does not seem to be same with the cloned spark session which you can access
> via {{{}df._jdf.sparkSession(){}}}.
> So which session to pick depends on the actual implementation of method in
> PySpark DataFrame, which users would never know. If it leads to pick the
> different session than expected, it leads to open backdoor for avoiding
> restrictions (e.g. AQE), unable to see session scoped resources (e.g. temp
> view), etc.
> So it’s quite critical to sync two sessions to refer the same.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]