[GitHub] [spark] ueshin commented on a change in pull request #32592: [SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes

GitBox Fri, 04 Jun 2021 00:49:25 -0700


ueshin commented on a change in pull request #32592:
URL: https://github.com/apache/spark/pull/32592#discussion_r645108695




##########
File path: python/pyspark/pandas/data_type_ops/base.py
##########
@@ -166,3 +175,11 @@ def rmod(self, left, right) -> Union["Series", "Index"]:
 
     def rpow(self, left, right) -> Union["Series", "Index"]:
         raise TypeError("Exponentiation can not be applied to %s." % 
self.pretty_name)
+
+    def restore(self, col):
+        """Restore column when to_pandas."""
+        return col
+
+    def prepare(self, col):

Review comment:
       ditto.

##########
File path: python/pyspark/pandas/data_type_ops/base.py
##########
@@ -166,3 +175,11 @@ def rmod(self, left, right) -> Union["Series", "Index"]:
 
     def rpow(self, left, right) -> Union["Series", "Index"]:
         raise TypeError("Exponentiation can not be applied to %s." % 
self.pretty_name)
+
+    def restore(self, col):

Review comment:
       Could you add type annotations?

##########
File path: python/pyspark/pandas/data_type_ops/categorical_ops.py
##########
@@ -26,3 +29,13 @@ class CategoricalOps(DataTypeOps):
     @property
     def pretty_name(self) -> str:
         return 'categoricals'
+
+    def restore(self, col):
+        """Restore column when to_pandas."""
+        return pd.Categorical.from_codes(
+            col, categories=self.dtype.categories, ordered=self.dtype.ordered
+        )
+
+    def prepare(self, col):
+        """Prepare column when from_pandas."""
+        return col.cat.codes.replace({np.nan: None})

Review comment:
       I guess we don't need `.replace({np.nan: None})`? 

##########
File path: python/pyspark/pandas/tests/indexes/test_base.py
##########
@@ -1254,31 +1254,32 @@ def test_monotonic(self):
             self.assert_eq(psmidx.is_monotonic_decreasing, False)
 
         else:
-            [(-5, None), (-4, None), (-3, None), (-2, None), (-1, None)]
+            # [(-5, None), (-4, None), (-3, None), (-2, None), (-1, None)]

Review comment:
       Ah, these were originally comments. Maybe we should add some notes to 
know these are comments? e.g.,:
   
   ```py
   # For [(-5, None), ...
   ```

##########
File path: python/pyspark/pandas/tests/test_internal.py
##########
@@ -58,6 +58,20 @@ def test_from_pandas(self):
 
         self.assert_eq(internal.to_pandas_frame, pdf1)
 
+        # categorical column
+        pdf2 = pd.DataFrame({0: [1, 2, 3], 1: pd.Categorical([4, 5, 6])})
+        internal = InternalFrame.from_pandas(pdf2)
+        sdf = internal.spark_frame
+
+        self.assert_eq(internal.index_spark_column_names, 
[SPARK_DEFAULT_INDEX_NAME])
+        self.assert_eq(internal.index_names, [None])
+        self.assert_eq(internal.column_labels, [(0,), (1,)])
+        self.assert_eq(internal.data_spark_column_names, ["0", "1"])
+        
self.assertTrue(internal.spark_column_for((0,))._jc.equals(sdf["0"]._jc))
+        
self.assertTrue(internal.spark_column_for((1,))._jc.equals(sdf["1"]._jc))

Review comment:
       You can use a `spark_column_equals` util function. #32680




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ueshin commented on a change in pull request #32592: [SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes

Reply via email to