[GitHub] [spark] Yikun commented on a diff in pull request #36981: [SPARK-39574][PS] Better error message when `ps.Index` is used for DataFrame/Series creation

GitBox Fri, 24 Jun 2022 18:16:23 -0700


Yikun commented on code in PR #36981:
URL: https://github.com/apache/spark/pull/36981#discussion_r906613529



##########
python/pyspark/pandas/tests/test_series.py:
##########
@@ -54,6 +54,13 @@ def pser(self):
     def psser(self):
         return ps.from_pandas(self.pser)
 
+    def test_creation_index(self):
+        with self.assertRaisesRegex(
+            TypeError,
+            "The given index cannot be a pandas-on-Spark index. Try 
pandas.Index or array-like.",
+        ):
+            ps.Series([1, 2], index=ps.Index([1, 2]))

Review Comment:
   ```
   ps.Series([1, 2, 3],index=ps.MultiIndex.from_tuples([("a", "x"), ("a", "y"), 
("b", "z")]))
   ```
   
   nit: Maybe we also want to test for sub class of ps.Index?



##########
python/pyspark/pandas/series.py:
##########
@@ -405,6 +405,14 @@ def __init__(  # type: ignore[no-untyped-def]
                 assert not fastpath
                 s = data
             else:
+                from pyspark.pandas.indexes.base import Index
+
+                if isinstance(index, Index):
+                    raise TypeError(

Review Comment:
   Before this, ps raise the `ValueError` (same behavior with pandas, see below 
example).
   ```
   pd.Series([1, 2, 3],index=ps.MultiIndex.from_tuples([("a", "x"), ("a", "y"), 
("b", "z")]))
   ValueError: The truth value of a MultiIndex is ambiguous. Use a.empty, 
a.bool(), a.item(), a.any() or a.all().
   ```
   
   It's a user face behavior changes. But I think it's reasonable, we might 
want to mention this in migration guide?



##########
python/pyspark/pandas/tests/test_dataframe.py:
##########
@@ -96,6 +96,13 @@ def test_dataframe(self):
         index_cols = pdf.columns[column_mask]
         self.assert_eq(psdf[index_cols], pdf[index_cols])
 
+    def test_creation_index(self):
+        with self.assertRaisesRegex(
+            TypeError,
+            "The given index cannot be a pandas-on-Spark index. Try 
pandas.Index or array-like.",
+        ):
+            ps.DataFrame([1, 2], index=ps.Index([1, 2]))

Review Comment:
   ditto



##########
python/pyspark/pandas/series.py:
##########
@@ -405,6 +405,14 @@ def __init__(  # type: ignore[no-untyped-def]
                 assert not fastpath
                 s = data
             else:
+                from pyspark.pandas.indexes.base import Index
+
+                if isinstance(index, Index):
+                    raise TypeError(
+                        "The given index cannot be a pandas-on-Spark index. "
+                        "Try pandas.Index or array-like."

Review Comment:
   ```suggestion
                           "Try pandas index or array-like."
   ```
   
   nit: Consider Index, MultiIndex, CategoricalIndex....so, maybe just
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun commented on a diff in pull request #36981: [SPARK-39574][PS] Better error message when `ps.Index` is used for DataFrame/Series creation

Reply via email to