[jira] [Updated] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__

Stephen Caraher (Jira) Sat, 06 Jun 2020 08:17:13 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stephen Caraher updated SPARK-31920:
------------------------------------
    Summary: Failure in converting pandas DataFrames with Arrow when columns 
implement __arrow_array__  (was: Failure in converting pandas DataFrames with 
columns via Arrow)

> Failure in converting pandas DataFrames with Arrow when columns implement 
> __arrow_array__
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-31920
>                 URL: https://issues.apache.org/jira/browse/SPARK-31920
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.5, 3.0.0, 3.1.0
>         Environment: pandas: 1.0.0 - 1.0.4
> pyarrow: 0.15.1 - 0.17.1
>            Reporter: Stephen Caraher
>            Priority: Major
>
> When calling {{createDataFrame}} on a pandas DataFrame in which any of the 
> columns are backed by an array implementing {{\_\_arrow_array\_\_}} 
> ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail.
> With pyarrow >= 0.17.0, the following exception occurs:
> {noformat}
> Traceback (most recent call last):
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
>     df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
>     data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
>     return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 435, in _create_from_pandas_with_arrow
>     jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
> create_RDD_server)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", 
> line 570, in _serialize_to_jvm
>     serializer.dump_stream(data, tempFile)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 204, in dump_stream
>     super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 88, in dump_stream
>     for batch in iterator:
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 203, in <genexpr>
>     batches = (self._create_batch(series) for series in iterator)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 194, in _create_batch
>     arrs.append(create_array(s, t))
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 161, in create_array
>     array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
>   File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 104, in 
> pyarrow.lib._handle_arrow_array_protocol
> ValueError: Cannot specify a mask or a size when passing an object that is 
> converted with the __arrow_array__ protocol.
> {noformat}
> With pyarrow < 0.17.0, the conversion will fail earlier in the process, 
> during schema extraction:
> {noformat}
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
>     df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
>     data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
>     return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 397, in _create_from_pandas_with_arrow
>     arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
>   File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas
>   File 
> "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 519, in dataframe_to_types
>     type_ = pa.lib._ndarray_to_arrow_type(values, type_)
>   File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type
>   File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
>   File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__

Reply via email to