Jungtaek Lim created SPARK-41379:
------------------------------------
Summary: Inconsistency of spark session in DataFrame in user
function for foreachBatch sink in PySpark
Key: SPARK-41379
URL: https://issues.apache.org/jira/browse/SPARK-41379
Project: Spark
Issue Type: Bug
Components: PySpark, Structured Streaming
Affects Versions: 3.3.2, 3.4.0
Reporter: Jungtaek Lim
[https://docs.databricks.com/_static/notebooks/merge-in-streaming.html]
According to some manual testing against above code example in PySpark, it
seems like the property of sparkSession in given DataFrame is not the same with
cloned session in streaming query. In other words, {{df.sparkSession}} does not
seem to be same with the cloned spark session which you can access via
{{{}df._jdf.sparkSession(){}}}.
So which session to pick depends on the actual implementation of method in
PySpark DataFrame, which users would never know. If it leads to pick the
different session than expected, it leads to open backdoor for avoiding
restrictions (e.g. AQE), unable to see session scoped resources (e.g. temp
view), etc.
So it’s quite critical to sync two sessions to refer the same.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]