ueshin opened a new pull request, #52616:
URL: https://github.com/apache/spark/pull/52616
### What changes were proposed in this pull request?
Fixes observations on Spark Connect with plan cache enabled.
### Why are the changes needed?
The observations on Spark Connect has cases that can't see the observed
values with plan cache enabled.
```py
>>> from pyspark.sql.observation import Observation
>>> from pyspark.sql import functions as F
>>>
>>> df = spark.range(10)
>>> observation = Observation()
>>> observed_df = df.observe(observation, F.count(F.lit(1)).alias("cnt"))
>>>
>>> observed_df.schema # cause to cache
StructType([StructField('id', LongType(), False)])
>>>
>>> observed_df.show()
+---+
| id|
+---+
...
+---+
>>>
>>> observation.get
{}
```
This should be:
```py
>>> observation.get
{'cnt': 10}
```
This is because the cached plan by the Analyze request for
`observed_df.schema` doesn't register the new `Observation` for the actual
execution.
The observations should be registered later.
### Does this PR introduce _any_ user-facing change?
Yes, the observations will be available.
### How was this patch tested?
Modified the related tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]