[GitHub] [spark] itholic opened a new pull request #33214: [SPARK-35929][PYTHON] Schema inference of nested structs defaults to map

GitBox Mon, 05 Jul 2021 01:15:15 -0700


itholic opened a new pull request #33214:
URL: https://github.com/apache/spark/pull/33214



   ### What changes were proposed in this pull request?
   
   Currently, inferring nested structs is always using `MapType`.
   
   This behavior causes an issue because it infers the schema with a value type 
of the first field of the struct as below:
   
   ```python
   data = [{"inside_struct": {"payment": 100.5, "name": "Lee"}}]
   df = spark.createDataFrame(data)
   df.show()
   +--------------------+
   |       inside_struct|
   +--------------------+
   |{name -> null, pa...|
   +--------------------+
   ```
   
   The "name" became `null`, but it should've been `"Lee"`.
   
   In this case, we need to be able to infer the schema with a `StructType` 
instead of a `MapType`.
   
   Therefore, this PR proposes adding an new configuration 
`spark.sql.pyspark.inferNestedStructByMap` to handle which type is used for 
inferring nested structs.
   - When `spark.sql.pyspark.inferNestedStructByMap` is `true` (by default), 
inferring nested structs by `MapType`
   - When `spark.sql.pyspark.inferNestedStructByMap` is `false`, inferring 
nested structs by `StructType`
   
   
   ### Why are the changes needed?
   
   Because always inferring the nested structs by `MapType` doesn't work 
properly for some cases.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   New configuration `spark.sql.pyspark.inferNestedStructByMap` is added.
   
   ### How was this patch tested?
   
   Added an unit test
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic opened a new pull request #33214: [SPARK-35929][PYTHON] Schema inference of nested structs defaults to map

Reply via email to