[GitHub] [spark] itholic commented on a diff in pull request #42956: [SPARK-43654][CONNECT][PS][TESTS] Enable `InternalFrameParityTests.test_from_pandas`

via GitHub Sun, 17 Sep 2023 20:00:03 -0700


itholic commented on code in PR #42956:
URL: https://github.com/apache/spark/pull/42956#discussion_r1328210599



##########
python/pyspark/pandas/tests/connect/test_parity_internal.py:
##########
@@ -15,18 +15,86 @@
 # limitations under the License.
 #
 import unittest
+import pandas as pd
 
 from pyspark.pandas.tests.test_internal import InternalFrameTestsMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
 from pyspark.testing.pandasutils import PandasOnSparkTestUtils
+from pyspark.pandas.internal import (
+    InternalFrame,
+    SPARK_DEFAULT_INDEX_NAME,
+    SPARK_INDEX_NAME_FORMAT,
+)
+from pyspark.pandas.utils import spark_column_equals
 
 
 class InternalFrameParityTests(
     InternalFrameTestsMixin, PandasOnSparkTestUtils, ReusedConnectTestCase
 ):
-    @unittest.skip("TODO(SPARK-43654): Enable 
InternalFrameParityTests.test_from_pandas.")
     def test_from_pandas(self):
-        super().test_from_pandas()
+        pdf = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

Review Comment:
   I copied this from 
[test_internal.py](https://github.com/apache/spark/blob/master/python/pyspark/pandas/tests/test_internal.py#L31-L107)
 with excluding tests that leverages `spark_column_equals`, e.g. 
`self.assertTrue(spark_column_equals(internal.spark_column_for(("a",)), 
sdf["a"]))`.
   
   Because currently `spark_column_equals` is working in different way from the 
"Non-Connect", since we can't compare the two Column object itself as below:
   
   **Non-Connect**
   ```python
   >>> sdf = spark.range(10)
   >>> sdf.id._jc.equals(sdf.id._jc)
   True
   ```
   
   **Connect**
   ```python
   >>> sdf = spark.range(10)
   >>> sdf.id._jc.equals(sdf.id._jc)
   # [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jc` is not supported in Spark 
Connect as it depends on the JVM. If you need to use this attribute, do not use 
Spark Connect when creating your session.
   ```
   
   But on my second thought regarding the comments, maybe we should find a 
proper way to compare the Column objects instead of separating the tests.
   
   Because iur current way to compare two Column object from Spark Connect is 
rely on comparing `repr` for each Column, but it's a bit hacky way so we should 
fix it even though it functions properly in our current code base.
   
   @zhengruifeng May I happen to ask your thought on this? Just as comparing 
equality by directly accessing a Java Object in Non-Connect mode, do you think 
this operation is also possible in Spark Connect?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic commented on a diff in pull request #42956: [SPARK-43654][CONNECT][PS][TESTS] Enable `InternalFrameParityTests.test_from_pandas`

Reply via email to