[
https://issues.apache.org/jira/browse/SPARK-42879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jiri Humpolicek updated SPARK-42879:
------------------------------------
Affects Version/s: 3.5.4
> Spark SQL reads unnecessary nested fields
> -----------------------------------------
>
> Key: SPARK-42879
> URL: https://issues.apache.org/jira/browse/SPARK-42879
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.3.2, 4.0.0, 3.5.2, 3.5.4
> Reporter: Jiri Humpolicek
> Priority: Major
>
> When we use more than one field from structure after explode, all fields will
> be read.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
> "items": [
> {"itemId": 1, "itemData1": "a", "itemData2": 11},
> {"itemId": 2, "itemData1": "b", "itemData2": 22}
> ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read
> .select(explode('items).as('item))
> .select($"item.itemId", $"item.itemData1")
> .explain
> // ReadSchema:
> struct<items:array<struct<itemData1:string,itemData2:bigint,itemId:bigint>>>
> {code}
> We use only *itemId* and *itemData1* fields from structure in array, but read
> schema contains *itemData2* field as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]