itholic opened a new pull request #33214:
URL: https://github.com/apache/spark/pull/33214
### What changes were proposed in this pull request?
Currently, inferring nested structs is always using `MapType`.
This behavior causes an issue because it infers the schema with a value type
of the first field of the struct as below:
```python
data = [{"inside_struct": {"payment": 100.5, "name": "Lee"}}]
df = spark.createDataFrame(data)
df.show()
+--------------------+
| inside_struct|
+--------------------+
|{name -> null, pa...|
+--------------------+
```
The "name" became `null`, but it should've been `"Lee"`.
In this case, we need to be able to infer the schema with a `StructType`
instead of a `MapType`.
Therefore, this PR proposes adding an new configuration
`spark.sql.pyspark.inferNestedStructByMap` to handle which type is used for
inferring nested structs.
- When `spark.sql.pyspark.inferNestedStructByMap` is `true` (by default),
inferring nested structs by `MapType`
- When `spark.sql.pyspark.inferNestedStructByMap` is `false`, inferring
nested structs by `StructType`
### Why are the changes needed?
Because always inferring the nested structs by `MapType` doesn't work
properly for some cases.
### Does this PR introduce _any_ user-facing change?
New configuration `spark.sql.pyspark.inferNestedStructByMap` is added.
### How was this patch tested?
Added an unit test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]