[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

GitBox Tue, 09 Jun 2020 23:52:03 -0700


viirya commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437896443




##########
File path: python/pyspark/sql/pandas/serializers.py
##########
@@ -150,15 +151,22 @@ def _create_batch(self, series):
         series = ((s, None) if not isinstance(s, (list, tuple)) else s for s 
in series)
 
         def create_array(s, t):
-            mask = s.isnull()
+            # Create with __arrow_array__ if the series' backing array 
implements it
+            series_array = getattr(s, 'array', s._values)
+            if hasattr(series_array, "__arrow_array__"):
+                return series_array.__arrow_array__(type=t)
+
             # Ensure timestamp series are in expected form for Spark internal 
representation
             if t is not None and pa.types.is_timestamp(t):
                 s = _check_series_convert_timestamps_internal(s, 
self._timezone)
-            elif type(s.dtype) == pd.CategoricalDtype:
+            elif is_categorical_dtype(s.dtype):
                 # Note: This can be removed once minimum pyarrow version is >= 
0.16.1
                 s = s.astype(s.dtypes.categories.dtype)
             try:
-                array = pa.Array.from_pandas(s, mask=mask, type=t, 
safe=self._safecheck)
+                mask = s.isnull()
+                # pass _ndarray_values to avoid potential failed type checks 
from pandas array types

Review comment:
       Is there any test case for this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Reply via email to