[GitHub] [spark] itholic commented on a diff in pull request #41149: [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame

via GitHub Sun, 14 May 2023 23:19:11 -0700


itholic commented on code in PR #41149:
URL: https://github.com/apache/spark/pull/41149#discussion_r1193369528



##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -186,6 +186,65 @@ def arrow_to_pandas(self, arrow_column):
         else:
             return s
 
+    def _create_array(self, series, arrow_type):
+        """
+        Create an Arrow Array from the given pandas.Series and optional type.
+
+        Parameters
+        ----------
+        series : pandas.Series
+            A single series
+        arrow_type : pyarrow.DataType, optional
+            If None, pyarrow's inferred type will be used
+
+        Returns
+        -------
+        pyarrow.Array
+        """
+        import pyarrow as pa
+        from pyspark.sql.pandas.types import (
+            _check_series_convert_timestamps_internal,
+            _convert_dict_to_map_items,
+        )
+        from pandas.api.types import is_categorical_dtype
+
+        if hasattr(series.array, "__arrow_array__"):
+            mask = None
+        else:
+            mask = series.isnull()
+        # Ensure timestamp series are in expected form for Spark internal 
representation
+        if (
+            arrow_type is not None
+            and pa.types.is_timestamp(arrow_type)
+            and arrow_type.tz is not None
+        ):
+            series = _check_series_convert_timestamps_internal(series, 
self._timezone)
+        elif arrow_type is not None and pa.types.is_map(arrow_type):
+            series = _convert_dict_to_map_items(series)
+        elif arrow_type is None and is_categorical_dtype(series.dtype):
+            series = series.astype(series.dtypes.categories.dtype)
+        try:
+            return pa.Array.from_pandas(series, mask=mask, type=arrow_type, 
safe=self._safecheck)
+        except TypeError as e:
+            error_msg = (
+                "Exception thrown when converting pandas.Series (%s) "
+                "with name '%s' to Arrow Array (%s)."
+            )
+            raise TypeError(error_msg % (series.dtype, series.name, 
arrow_type)) from e
+        except ValueError as e:
+            error_msg = (
+                "Exception thrown when converting pandas.Series (%s) "
+                "with name '%s' to Arrow Array (%s)."
+            )
+            if self._safecheck:
+                error_msg = error_msg + (
+                    " It can be caused by overflows or other "
+                    "unsafe conversions warned by Arrow. Arrow safe type check 
"
+                    "can be disabled by using SQL config "
+                    "`spark.sql.execution.pandas.convertToArrowArraySafely`."
+                )
+            raise ValueError(error_msg % (series.dtype, series.name, 
arrow_type)) from e

Review Comment:
   Yeah, at least here I think we should raise `PySparkValueError`.
   For above errors seems like they're generated from PyArrow, so I guess maybe 
we can't catch them by `PySparkxxxError`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #41149: [SPARK-43473][PYTHON] Support struct type in createDataFrame from pandas DataFrame

Reply via email to