ueshin opened a new pull request, #40829:
URL: https://github.com/apache/spark/pull/40829
### What changes were proposed in this pull request?
Uses deduplicated field names when creating Arrow `RecordBatch`.
The result pandas DataFrame will contain `dict` with suffix `_0`, `_1`, etc.
if there are duplicated field names.
For example:
```py
>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> spark.sql("values (1, struct(1 as a, 2 as a, 3 as b)) as t(x,
y)").toPandas()
x y
0 1 {'a_0': 1, 'a_1': 2, 'b': 3}
```
### Why are the changes needed?
Currently `df.toPandas()` with Arrow enabled fails when there are duplicated
field names.
```py
>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> spark.sql("values (1, struct(1 as a, 2 as a, 3 as b)) as t(x,
y)").toPandas()
Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed
```
### Does this PR introduce _any_ user-facing change?
`df.toPandas()` with Arrow enabled fails when there are duplicated field
names will work.
### How was this patch tested?
Added a test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]