[ https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephen Caraher updated SPARK-31920: ------------------------------------ Summary: Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__ (was: Failure in converting pandas DataFrames with columns via Arrow) > Failure in converting pandas DataFrames with Arrow when columns implement > __arrow_array__ > ----------------------------------------------------------------------------------------- > > Key: SPARK-31920 > URL: https://issues.apache.org/jira/browse/SPARK-31920 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.5, 3.0.0, 3.1.0 > Environment: pandas: 1.0.0 - 1.0.4 > pyarrow: 0.15.1 - 0.17.1 > Reporter: Stephen Caraher > Priority: Major > > When callingĀ {{createDataFrame}} on a pandas DataFrame in which any of the > columns are backed by an array implementing {{\_\_arrow_array\_\_}} > ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail. > With pyarrow >= 0.17.0, the following exception occurs: > {noformat} > Traceback (most recent call last): > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 435, in _create_from_pandas_with_arrow > jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, > create_RDD_server) > File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", > line 570, in _serialize_to_jvm > serializer.dump_stream(data, tempFile) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 204, in dump_stream > super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 88, in dump_stream > for batch in iterator: > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 203, in <genexpr> > batches = (self._create_batch(series) for series in iterator) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 194, in _create_batch > arrs.append(create_array(s, t)) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 161, in create_array > array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) > File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 215, in pyarrow.lib.array > File "pyarrow/array.pxi", line 104, in > pyarrow.lib._handle_arrow_array_protocol > ValueError: Cannot specify a mask or a size when passing an object that is > converted with the __arrow_array__ protocol. > {noformat} > With pyarrow < 0.17.0, the conversion will fail earlier in the process, > during schema extraction: > {noformat} > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 397, in _create_from_pandas_with_arrow > arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) > File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas > File > "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 519, in dataframe_to_types > type_ = pa.lib._ndarray_to_arrow_type(values, type_) > File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type > File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org