[ 
https://issues.apache.org/jira/browse/SPARK-41379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41379:
------------------------------------

    Assignee: Apache Spark

> Inconsistency of spark session in DataFrame in user function for foreachBatch 
> sink in PySpark
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41379
>                 URL: https://issues.apache.org/jira/browse/SPARK-41379
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Structured Streaming
>    Affects Versions: 3.3.2, 3.4.0
>            Reporter: Jungtaek Lim
>            Assignee: Apache Spark
>            Priority: Major
>
> [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html]
> According to some manual testing against above code example in PySpark, it 
> seems like the property of sparkSession in given DataFrame is not the same 
> with cloned session in streaming query. In other words, {{df.sparkSession}} 
> does not seem to be same with the cloned spark session which you can access 
> via {{{}df._jdf.sparkSession(){}}}.
> So which session to pick depends on the actual implementation of method in 
> PySpark DataFrame, which users would never know. If it leads to pick the 
> different session than expected, it leads to open backdoor for avoiding 
> restrictions (e.g. AQE), unable to see session scoped resources (e.g. temp 
> view), etc.
> So it’s quite critical to sync two sessions to refer the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to