[spark] branch master updated: [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache the schema

gurwls223 Wed, 08 Mar 2023 16:41:31 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new a77bb37b411 [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() 
should not cache the schema
a77bb37b411 is described below

commit a77bb37b4112543fcd77a7f6091e465daeb3f8ae
Author: Rui Wang <[email protected]>
AuthorDate: Thu Mar 9 09:41:08 2023 +0900

    [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache 
the schema
    
    ### What changes were proposed in this pull request?
    
    As of now that Connect Python client cache the schema when calling `def 
schema()`. However this might cause stale data issue. For example:
    ```
    1. Create table
    2. table.schema
    3. drop table and recreate the table with different schema
    4. table.schema // now this is incorrect
    ```
    
    ### Why are the changes needed?
    
    Fix the behavior when the cached schema could be stale.
    
    ### Does this PR introduce _any_ user-facing change?
    
    This is actually a fix that users now can always see the most up-to-dated 
schema.
    
    ### How was this patch tested?
    
    Existing UT
    
    Closes #40343 from amaliujia/disable_cache.
    
    Authored-by: Rui Wang <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/sql/connect/dataframe.py | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index f8b92cdc7ae..504b83d1165 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1367,17 +1367,13 @@ class DataFrame:
 
     @property
     def schema(self) -> StructType:
-        if self._schema is None:
-            if self._plan is not None:
-                query = self._plan.to_proto(self._session.client)
-                if self._session is None:
-                    raise Exception("Cannot analyze without SparkSession.")
-                self._schema = self._session.client.schema(query)
-                return self._schema
-            else:
-                raise Exception("Empty plan.")
+        if self._plan is not None:
+            query = self._plan.to_proto(self._session.client)
+            if self._session is None:
+                raise Exception("Cannot analyze without SparkSession.")
+            return self._session.client.schema(query)
         else:
-            return self._schema
+            raise Exception("Empty plan.")
 
     schema.__doc__ = PySparkDataFrame.schema.__doc__
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache the schema

Reply via email to