ueshin opened a new pull request, #41190:
URL: https://github.com/apache/spark/pull/41190
### What changes were proposed in this pull request?
Support duplicated field names in `createDataFrame` with pandas DataFrame.
For with Arrow, without Arrow, and Spark Connect:
```py
>>> spark.createDataFrame(pdf, schema).show()
+--------+---------------+
|struct_0| struct_1|
+--------+---------------+
| {a, 1}|{2, 3, b, 4, c}|
| {x, 6}|{7, 8, y, 9, z}|
+--------+---------------+
```
### Why are the changes needed?
If there are duplicated field names, `createDataFrame` with pandas DataFrame
fallbacks to without Arrow, or fails in Spark Connect.
```py
>>> import pandas as pd
>>> from pyspark.sql.types import *
>>>
>>> schema = (
... StructType()
... .add("struct_0", StructType().add("x", StringType()).add("x",
IntegerType()))
... .add(
... "struct_1",
... StructType()
... .add("a", IntegerType())
... .add("x", IntegerType())
... .add("x", StringType())
... .add("y", IntegerType())
... .add("y", StringType()),
... )
... )
>>>
>>> data = [Row(Row("a", 1), Row(2, 3, "b", 4, "c")), Row(Row("x", 6),
Row(7, 8, "y", 9, "z"))]
>>> pdf = pd.DataFrame.from_records(data, columns=schema.names)
```
- Without Arrow:
Works fine.
```py
>>> spark.createDataFrame(pdf, schema).show()
+--------+---------------+
|struct_0| struct_1|
+--------+---------------+
| {a, 1}|{2, 3, b, 4, c}|
| {x, 6}|{7, 8, y, 9, z}|
+--------+---------------+
```
- With Arrow:
Works with fallback.
```py
>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> spark.createDataFrame(pdf, schema).show()
/.../pyspark/sql/pandas/conversion.py:347: UserWarning: createDataFrame
attempted Arrow optimization because
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by
the reason below:
[DUPLICATED_FIELD_NAME_IN_ARROW_STRUCT] Duplicated field names in Arrow
Struct are not allowed, got [x, x].
Attempting non-optimization as
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warn(msg)
+--------+---------------+
|struct_0| struct_1|
+--------+---------------+
| {a, 1}|{2, 3, b, 4, c}|
| {x, 6}|{7, 8, y, 9, z}|
+--------+---------------+
```
- Spark Connect
Fails.
```py
>>> spark.createDataFrame(pdf, schema).show()
...
Traceback (most recent call last):
...
pyspark.errors.exceptions.connect.IllegalArgumentException: not all nodes
and buffers were consumed. ...
```
### Does this PR introduce _any_ user-facing change?
The duplicated field names will work.
### How was this patch tested?
Added the related test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]