[GitHub] [spark] ueshin opened a new pull request, #41190: [SPARK-43528][SQL][PYTHON] Support duplicated field names in createDataFrame with pandas DataFrame

via GitHub Tue, 16 May 2023 15:06:32 -0700


ueshin opened a new pull request, #41190:
URL: https://github.com/apache/spark/pull/41190


   ### What changes were proposed in this pull request?
   
   Support duplicated field names in `createDataFrame` with pandas DataFrame.
   
   For with Arrow, without Arrow, and Spark Connect:
   
   ```py
   >>> spark.createDataFrame(pdf, schema).show()
   +--------+---------------+
   |struct_0|       struct_1|
   +--------+---------------+
   |  {a, 1}|{2, 3, b, 4, c}|
   |  {x, 6}|{7, 8, y, 9, z}|
   +--------+---------------+
   ```
   
   ### Why are the changes needed?
   
   If there are duplicated field names, `createDataFrame` with pandas DataFrame 
fallbacks to without Arrow, or fails in Spark Connect.
   
   ```py
   >>> import pandas as pd
   >>> from pyspark.sql.types import *
   >>>
   >>> schema = (
   ...     StructType()
   ...     .add("struct_0", StructType().add("x", StringType()).add("x", 
IntegerType()))
   ...     .add(
   ...         "struct_1",
   ...         StructType()
   ...         .add("a", IntegerType())
   ...         .add("x", IntegerType())
   ...         .add("x", StringType())
   ...         .add("y", IntegerType())
   ...         .add("y", StringType()),
   ...     )
   ... )
   >>>
   >>> data = [Row(Row("a", 1), Row(2, 3, "b", 4, "c")), Row(Row("x", 6), 
Row(7, 8, "y", 9, "z"))]
   >>> pdf = pd.DataFrame.from_records(data, columns=schema.names)
   ```
   
   - Without Arrow:
   
   Works fine.
   
   ```py
   >>> spark.createDataFrame(pdf, schema).show()
   +--------+---------------+
   |struct_0|       struct_1|
   +--------+---------------+
   |  {a, 1}|{2, 3, b, 4, c}|
   |  {x, 6}|{7, 8, y, 9, z}|
   +--------+---------------+
   ```
   
   - With Arrow:
   
   Works with fallback.
   
   ```py
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.createDataFrame(pdf, schema).show()
   /.../pyspark/sql/pandas/conversion.py:347: UserWarning: createDataFrame 
attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
     [DUPLICATED_FIELD_NAME_IN_ARROW_STRUCT] Duplicated field names in Arrow 
Struct are not allowed, got [x, x].
   Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
     warn(msg)
   +--------+---------------+
   |struct_0|       struct_1|
   +--------+---------------+
   |  {a, 1}|{2, 3, b, 4, c}|
   |  {x, 6}|{7, 8, y, 9, z}|
   +--------+---------------+
   ```
   
   - Spark Connect
   
   Fails.
   
   ```py
   >>> spark.createDataFrame(pdf, schema).show()
   ...
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.IllegalArgumentException: not all nodes 
and buffers were consumed. ...
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   The duplicated field names will work.
   
   ### How was this patch tested?
   
   Added the related test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ueshin opened a new pull request, #41190: [SPARK-43528][SQL][PYTHON] Support duplicated field names in createDataFrame with pandas DataFrame

Reply via email to