[GitHub] [spark] zhengruifeng commented on pull request #39462: [SPARK-41879][CONNECT][PYTHON] Make `DataFrame.collect` support nested types

GitBox Tue, 10 Jan 2023 19:04:15 -0800


zhengruifeng commented on PR #39462:
URL: https://github.com/apache/spark/pull/39462#issuecomment-1378183223


   the duplicate field name is also not support in PySpark with 
arrow-optimization enabled.
   
   ```
   In [22]: query = "select STRUCT(1 v, 1 v)"
   
   In [23]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
   
   In [24]: spark.sql(query).toPandas()
   Out[24]: 
     struct(1 AS v, 1 AS v)
   0                 (1, 1)
   
   In [25]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
   
   In [26]: spark.sql(query).toPandas()
   /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the 
error below and can not continue. Note that 
'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on 
failures in the middle of computation.
     Ran out of field metadata, likely malformed
     warn(msg)
   ---------------------------------------------------------------------------
   ArrowInvalid                              Traceback (most recent call last)
   Cell In[26], line 1
   ----> 1 spark.sql(query).toPandas()
   
   File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
PandasConversionMixin.toPandas(self)
       141 tmp_column_names = ["col_{}".format(i) for i in 
range(len(self.columns))]
       142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
   --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
       144     split_batches=self_destruct
       145 )
       146 if len(batches) > 0:
       147     table = pyarrow.Table.from_batches(batches)
   
   File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
PandasConversionMixin._collect_as_arrow(self, split_batches)
       356             results.append(batch_or_indices)
       357     else:
   --> 358         results = list(batch_stream)
       359 finally:
       360     # Join serving thread and raise any exceptions from 
collectAsArrowToPython
       361     jsocket_auth_server.getResult()
   
   File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
ArrowCollectSerializer.load_stream(self, stream)
        50 """
        51 Load a stream of un-ordered Arrow RecordBatches, where the last 
iteration yields
        52 a list of indices that can be used to put the RecordBatches in the 
correct order.
        53 """
        54 # load the batches
   ---> 55 for batch in self.serializer.load_stream(stream):
        56     yield batch
        58 # load the batch order indices or propagate any error that occurred 
in the JVM
   
   File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
ArrowStreamSerializer.load_stream(self, stream)
        95 import pyarrow as pa
        97 reader = pa.ipc.open_stream(stream)
   ---> 98 for batch in reader:
        99     yield batch
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
 in __iter__()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
 in pyarrow.lib.RecordBatchReader.read_next_batch()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
 in pyarrow.lib.check_status()
   
   ArrowInvalid: Ran out of field metadata, likely malformed
   ```
   
   the reader can not read the arrow batches
   
   since I checked that the arrow data with duplicate field names should be 
able to read by `pa.ipc.open_stream` 
[above](https://github.com/apache/spark/pull/39462#issuecomment-1375619011), I 
guess there is something wrong in `ArrowConverters` in the server side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #39462: [SPARK-41879][CONNECT][PYTHON] Make `DataFrame.collect` support nested types

Reply via email to