Re: [PR] [SPARK-42944][FOLLOWUP][SS][CONNECT] Reenable ApplyInPandasWithState tests [spark]

via GitHub Thu, 06 Jun 2024 11:15:03 -0700


WweiL commented on code in PR #46853:
URL: https://github.com/apache/spark/pull/46853#discussion_r1630009086



##########
python/pyspark/sql/tests/pandas/test_pandas_grouped_map_with_state.py:
##########
@@ -141,7 +138,7 @@ def func(key, pdf_iter, state):
             yield pd.DataFrame({"key": [], "countAsString": []})
 
         def check_results(batch_df, _):
-            self.assertTrue(len(set(batch_df.sort("key").collect())) == 0)
+            assert len(set(batch_df.sort("key").collect())) == 0

Review Comment:
   Yes the test failure before was pickle serialization
   ```
   ======================================================================
   ERROR [0.038s]: test_apply_in_pandas_with_state_basic_no_state_no_data 
(pyspark.sql.tests.connect.test_parity_pandas_grouped_map_with_state.GroupedApplyInPandasWithStateTests)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File 
"/__w/oss-spark/oss-spark/python/pyspark/sql/tests/pandas/test_pandas_grouped_map_with_state.py",
 line 146, in test_apply_in_pandas_with_state_basic_no_state_no_data
       self._test_apply_in_pandas_with_state_basic(func, check_results)
     File 
"/__w/oss-spark/oss-spark/python/pyspark/sql/tests/pandas/test_pandas_grouped_map_with_state.py",
 line 85, in _test_apply_in_pandas_with_state_basic
       df.groupBy(df["value"])
     File 
"/__w/oss-spark/oss-spark/python/pyspark/sql/connect/streaming/readwriter.py", 
line 500, in foreachBatch
       self._write_proto.foreach_batch.python_function.command = 
CloudPickleSerializer().dumps(
     File "/__w/oss-spark/oss-spark/python/pyspark/serializers.py", line 459, 
in dumps
       return cloudpickle.dumps(obj, pickle_protocol)
     File 
"/__w/oss-spark/oss-spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 
73, in dumps
       cp.dump(obj)
     File 
"/__w/oss-spark/oss-spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 
632, in dump
       return Pickler.dump(self, obj)
     File 
"/__w/oss-spark/oss-spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 
343, in _file_reduce
       raise pickle.PicklingError(
   _pickle.PicklingError: Cannot pickle files that are not opened for reading: w
   ```
   
   I actually do not know what exactly happens inside the serialization, but 
definitely that "self" caused issues, looks like the test class itself isn't in 
a good state to be serialized.
   
   So switching it to a pure assert.
   
   This reminds me that I should also add a check to make sure there is no 
query.exception(), because when that assert throws, the query could silently 
stop with an error, so let me add that check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-42944][FOLLOWUP][SS][CONNECT] Reenable ApplyInPandasWithState tests [spark]

Reply via email to