This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new a77bb37b411 [SPARK-42722][CONNECT][PYTHON] Python Connect def schema()
should not cache the schema
a77bb37b411 is described below
commit a77bb37b4112543fcd77a7f6091e465daeb3f8ae
Author: Rui Wang <[email protected]>
AuthorDate: Thu Mar 9 09:41:08 2023 +0900
[SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache
the schema
### What changes were proposed in this pull request?
As of now that Connect Python client cache the schema when calling `def
schema()`. However this might cause stale data issue. For example:
```
1. Create table
2. table.schema
3. drop table and recreate the table with different schema
4. table.schema // now this is incorrect
```
### Why are the changes needed?
Fix the behavior when the cached schema could be stale.
### Does this PR introduce _any_ user-facing change?
This is actually a fix that users now can always see the most up-to-dated
schema.
### How was this patch tested?
Existing UT
Closes #40343 from amaliujia/disable_cache.
Authored-by: Rui Wang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/sql/connect/dataframe.py | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/python/pyspark/sql/connect/dataframe.py
b/python/pyspark/sql/connect/dataframe.py
index f8b92cdc7ae..504b83d1165 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1367,17 +1367,13 @@ class DataFrame:
@property
def schema(self) -> StructType:
- if self._schema is None:
- if self._plan is not None:
- query = self._plan.to_proto(self._session.client)
- if self._session is None:
- raise Exception("Cannot analyze without SparkSession.")
- self._schema = self._session.client.schema(query)
- return self._schema
- else:
- raise Exception("Empty plan.")
+ if self._plan is not None:
+ query = self._plan.to_proto(self._session.client)
+ if self._session is None:
+ raise Exception("Cannot analyze without SparkSession.")
+ return self._session.client.schema(query)
else:
- return self._schema
+ raise Exception("Empty plan.")
schema.__doc__ = PySparkDataFrame.schema.__doc__
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]