[GitHub] [spark] ueshin opened a new pull request, #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

via GitHub Thu, 27 Apr 2023 19:53:58 -0700


ueshin opened a new pull request, #40988:
URL: https://github.com/apache/spark/pull/40988


   ### What changes were proposed in this pull request?
   
   Adds a config for pandas conversion how to handle struct types.
   
   - `spark.sql.execution.pandas.structHandlingMode` (default: `"legacy"`)
   
   The conversion mode of struct type when creating pandas DataFrame.
   
   #### When `"legacy"`, the behavior is the same as before, except that with 
Arrow and Spark Connect will raise a more readable exception when there are 
duplicated nested field names.
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.UnsupportedOperationException: 
[DUPLICATED_FIELD_NAME_IN_ARROW_STRUCT] Duplicated field names in Arrow Struct 
are not allowed, got [a, a].
   ```
   
   #### When `"row"`, convert to Row object regardless of Arrow optimization.
   
   ```py
   >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row')
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   ```
   
   #### When `"dict"`, convert to dict and use suffixed key names, e.g., `a_0`, 
`a_1`, if there are duplicated nested field names, regardless of Arrow 
optimization.
   
   ```py
   >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'dict')
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x                     y
   0  1  {'a_0': 1, 'a_1': 2}
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x                     y
   0  1  {'a_0': 1, 'a_1': 2}
   ```
   
   ### Why are the changes needed?
   
   Currently there are three behaviors when `df.toPandas()` with nested struct 
types:
   
   - vanilla PySpark with Arrow optimization disabled
   
   ```py
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   ```
   
   using `Row` object for struct types.
   
   It can use duplicated field names.
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   ```
   
   - vanilla PySpark with Arrow optimization enabled
   
   ```py
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   ```
   
   using `dict` for struct types.
   
   It raises an Exception when there are duplicated nested field names:
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
   Traceback (most recent call last):
   ...
   pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed
   ```
   
   - Spark Connect
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   ```
   
   using `dict` for struct types.
   
   If there are duplicated nested field names, the duplicated keys are suffixed:
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x                     y
   0  1  {'a_0': 1, 'a_1': 2}
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Users will be able to configure the behavior.
   
   ### How was this patch tested?
   
   Modified the related tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ueshin opened a new pull request, #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

Reply via email to