[GitHub] [spark] zhengruifeng commented on pull request #39462: [SPARK-41879][CONNECT][PYTHON] Make `DataFrame.collect` support nested types

GitBox Mon, 09 Jan 2023 05:21:34 -0800


zhengruifeng commented on PR #39462:
URL: https://github.com/apache/spark/pull/39462#issuecomment-1375619011


   update on duplicated field names in struct type:
   
   ```
   In [4]: 
      ...: xs = pa.array([5, 6, 7], type=pa.int16())
      ...: ys = pa.array([False, True, True])
      ...: arr = pa.StructArray.from_arrays((xs, xs), names=('x', 'x'))
      ...: table = pa.Table.from_arrays([arr], names=["s"])
      ...: table
   Out[4]: 
   pyarrow.Table
   s: struct<x: int16, x: int16>
     child 0, x: int16
     child 1, x: int16
   ----
   s: [
     -- is_valid: all not null
     -- child 0 type: int16
   [5,6,7]
     -- child 1 type: int16
   [5,6,7]]
   
   In [5]: 
      ...: table.schema
   Out[5]: 
   s: struct<x: int16, x: int16>
     child 0, x: int16
     child 1, x: int16
   
   In [6]: 
      ...: sink = pa.BufferOutputStream()
      ...: 
      ...: with pa.ipc.new_stream(sink, table.schema) as writer:
      ...:     for b in table.to_batches():
      ...:         writer.write_batch(b)
      ...: 
      ...: 
      ...: table_bytes = sink.getvalue().to_pybytes()
      ...: 
      ...: 
      ...: batches = []
      ...: 
      ...: with pa.ipc.open_stream(table_bytes) as reader:
      ...:     for batch in reader:
      ...:         batches.append(batch)
      ...: 
      ...: table2 = pa.Table.from_batches(batches=batches)
   
   In [7]: table2
   Out[7]: 
   pyarrow.Table
   s: struct<x: int16, x: int16>
     child 0, x: int16
     child 1, x: int16
   ----
   s: [
     -- is_valid: all not null
     -- child 0 type: int16
   [5,6,7]
     -- child 1 type: int16
   [5,6,7]]
   
   In [8]: table == table2
   Out[8]: True
   
   In [18]: table2.to_pydict()
   ---------------------------------------------------------------------------
   ArrowInvalid                              Traceback (most recent call last)
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/scalar.pxi:692,
 in pyarrow.lib.StructScalar.__getitem__()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:144,
 in pyarrow.lib.pyarrow_internal_check_status()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
 in pyarrow.lib.check_status()
   
   ArrowInvalid: Multiple matches for FieldRef.Name(x) in struct<x: int16, x: 
int16>
   
   The above exception was the direct cause of the following exception:
   
   KeyError                                  Traceback (most recent call last)
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/scalar.pxi:705,
 in pyarrow.lib.StructScalar.as_py()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/scalar.pxi:697,
 in pyarrow.lib.StructScalar.__getitem__()
   
   KeyError: 'x'
   
   During handling of the above exception, another exception occurred:
   
   ValueError                                Traceback (most recent call last)
   Cell In[18], line 1
   ----> 1 table2.to_pydict()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/table.pxi:3940,
 in pyarrow.lib.Table.to_pydict()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/table.pxi:1261,
 in pyarrow.lib.ChunkedArray.to_pylist()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/array.pxi:1475,
 in pyarrow.lib.Array.to_pylist()
   
   File 
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/scalar.pxi:707,
 in pyarrow.lib.StructScalar.as_py()
   
   ValueError: Converting to Python dictionary is not supported when duplicate 
field names are present
   
   In [23]: col0 = table2.column(0)
   
   In [24]: col0.to_pydict()
   ---------------------------------------------------------------------------
   AttributeError                            Traceback (most recent call last)
   Cell In[24], line 1
   ----> 1 col0.to_pydict()
   
   AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 
'to_pydict'
   
   In [28]: col0.data
   <ipython-input-28-d4c1dc071e26>:1: FutureWarning: Calling .data on 
ChunkedArray is provided for compatibility after Column was removed, simply 
drop this attribute
     col0.data
   Out[28]: 
   <pyarrow.lib.ChunkedArray object at 0x10ab28ae0>
   [
     -- is_valid: all not null
     -- child 0 type: int16
       [
         5,
         6,
         7
       ]
     -- child 1 type: int16
       [
         5,
         6,
         7
       ]
   ]
   ```
   
   it seems that PyArrow itself support duplicate field names.
   
   ```
           self.assertEqual(
               cdf.select(CF.struct("a", "a")).collect(),
               sdf.select(SF.struct("a", "a")).collect(),
           )
   ```
   
   ```
   ======================================================================
   ERROR [0.628s]: test_collect_nested_type 
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectBasicTests)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File 
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/tests/connect/test_connect_basic.py",
 line 2104, in test_collect_nested_type
       cdf.select(CF.struct("a", "a")).collect(),
     File 
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", line 
1228, in collect
       table = self._session.client.to_table(query)
     File 
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/client.py", line 
415, in to_table
       table, _ = self._execute_and_fetch(req)
     File 
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/client.py", line 
589, in _execute_and_fetch
       for batch in reader:
     File "pyarrow/ipc.pxi", line 638, in __iter__
     File "pyarrow/ipc.pxi", line 674, in 
pyarrow.lib.RecordBatchReader.read_next_batch
     File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed
   
   ----------------------------------------------------------------------
   ```
   
   This need future investigation....
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #39462: [SPARK-41879][CONNECT][PYTHON] Make `DataFrame.collect` support nested types

Reply via email to