zhengruifeng commented on PR #39462:
URL: https://github.com/apache/spark/pull/39462#issuecomment-1378183223
the duplicate field name is also not support in PySpark with
arrow-optimization enabled.
```
In [22]: query = "select STRUCT(1 v, 1 v)"
In [23]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
In [24]: spark.sql(query).toPandas()
Out[24]:
struct(1 AS v, 1 AS v)
0 (1, 1)
In [25]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
In [26]: spark.sql(query).toPandas()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204:
UserWarning: toPandas attempted Arrow optimization because
'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the
error below and can not continue. Note that
'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on
failures in the middle of computation.
Ran out of field metadata, likely malformed
warn(msg)
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[26], line 1
----> 1 spark.sql(query).toPandas()
File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in
PandasConversionMixin.toPandas(self)
141 tmp_column_names = ["col_{}".format(i) for i in
range(len(self.columns))]
142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
--> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
144 split_batches=self_destruct
145 )
146 if len(batches) > 0:
147 table = pyarrow.Table.from_batches(batches)
File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in
PandasConversionMixin._collect_as_arrow(self, split_batches)
356 results.append(batch_or_indices)
357 else:
--> 358 results = list(batch_stream)
359 finally:
360 # Join serving thread and raise any exceptions from
collectAsArrowToPython
361 jsocket_auth_server.getResult()
File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in
ArrowCollectSerializer.load_stream(self, stream)
50 """
51 Load a stream of un-ordered Arrow RecordBatches, where the last
iteration yields
52 a list of indices that can be used to put the RecordBatches in the
correct order.
53 """
54 # load the batches
---> 55 for batch in self.serializer.load_stream(stream):
56 yield batch
58 # load the batch order indices or propagate any error that occurred
in the JVM
File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in
ArrowStreamSerializer.load_stream(self, stream)
95 import pyarrow as pa
97 reader = pa.ipc.open_stream(stream)
---> 98 for batch in reader:
99 yield batch
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
in __iter__()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
in pyarrow.lib.RecordBatchReader.read_next_batch()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
in pyarrow.lib.check_status()
ArrowInvalid: Ran out of field metadata, likely malformed
```
the reader can not read the arrow batches
since I checked that the arrow data with duplicate field names should be
able to read by `pa.ipc.open_stream`
[above](https://github.com/apache/spark/pull/39462#issuecomment-1375619011), I
guess there is something wrong in `ArrowConverters` in the server side.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]