zhengruifeng commented on PR #39462:
URL: https://github.com/apache/spark/pull/39462#issuecomment-1375619011
update on duplicated field names in struct type:
```
In [4]:
...: xs = pa.array([5, 6, 7], type=pa.int16())
...: ys = pa.array([False, True, True])
...: arr = pa.StructArray.from_arrays((xs, xs), names=('x', 'x'))
...: table = pa.Table.from_arrays([arr], names=["s"])
...: table
Out[4]:
pyarrow.Table
s: struct<x: int16, x: int16>
child 0, x: int16
child 1, x: int16
----
s: [
-- is_valid: all not null
-- child 0 type: int16
[5,6,7]
-- child 1 type: int16
[5,6,7]]
In [5]:
...: table.schema
Out[5]:
s: struct<x: int16, x: int16>
child 0, x: int16
child 1, x: int16
In [6]:
...: sink = pa.BufferOutputStream()
...:
...: with pa.ipc.new_stream(sink, table.schema) as writer:
...: for b in table.to_batches():
...: writer.write_batch(b)
...:
...:
...: table_bytes = sink.getvalue().to_pybytes()
...:
...:
...: batches = []
...:
...: with pa.ipc.open_stream(table_bytes) as reader:
...: for batch in reader:
...: batches.append(batch)
...:
...: table2 = pa.Table.from_batches(batches=batches)
In [7]: table2
Out[7]:
pyarrow.Table
s: struct<x: int16, x: int16>
child 0, x: int16
child 1, x: int16
----
s: [
-- is_valid: all not null
-- child 0 type: int16
[5,6,7]
-- child 1 type: int16
[5,6,7]]
In [8]: table == table2
Out[8]: True
In [18]: table2.to_pydict()
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/scalar.pxi:692,
in pyarrow.lib.StructScalar.__getitem__()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:144,
in pyarrow.lib.pyarrow_internal_check_status()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
in pyarrow.lib.check_status()
ArrowInvalid: Multiple matches for FieldRef.Name(x) in struct<x: int16, x:
int16>
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/scalar.pxi:705,
in pyarrow.lib.StructScalar.as_py()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/scalar.pxi:697,
in pyarrow.lib.StructScalar.__getitem__()
KeyError: 'x'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[18], line 1
----> 1 table2.to_pydict()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/table.pxi:3940,
in pyarrow.lib.Table.to_pydict()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/table.pxi:1261,
in pyarrow.lib.ChunkedArray.to_pylist()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/array.pxi:1475,
in pyarrow.lib.Array.to_pylist()
File
~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/scalar.pxi:707,
in pyarrow.lib.StructScalar.as_py()
ValueError: Converting to Python dictionary is not supported when duplicate
field names are present
In [23]: col0 = table2.column(0)
In [24]: col0.to_pydict()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[24], line 1
----> 1 col0.to_pydict()
AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute
'to_pydict'
In [28]: col0.data
<ipython-input-28-d4c1dc071e26>:1: FutureWarning: Calling .data on
ChunkedArray is provided for compatibility after Column was removed, simply
drop this attribute
col0.data
Out[28]:
<pyarrow.lib.ChunkedArray object at 0x10ab28ae0>
[
-- is_valid: all not null
-- child 0 type: int16
[
5,
6,
7
]
-- child 1 type: int16
[
5,
6,
7
]
]
```
it seems that PyArrow itself support duplicate field names.
```
self.assertEqual(
cdf.select(CF.struct("a", "a")).collect(),
sdf.select(SF.struct("a", "a")).collect(),
)
```
```
======================================================================
ERROR [0.628s]: test_collect_nested_type
(pyspark.sql.tests.connect.test_connect_basic.SparkConnectBasicTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/tests/connect/test_connect_basic.py",
line 2104, in test_collect_nested_type
cdf.select(CF.struct("a", "a")).collect(),
File
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py", line
1228, in collect
table = self._session.client.to_table(query)
File
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/client.py", line
415, in to_table
table, _ = self._execute_and_fetch(req)
File
"/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/client.py", line
589, in _execute_and_fetch
for batch in reader:
File "pyarrow/ipc.pxi", line 638, in __iter__
File "pyarrow/ipc.pxi", line 674, in
pyarrow.lib.RecordBatchReader.read_next_batch
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed
----------------------------------------------------------------------
```
This need future investigation....
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]