[jira] [Updated] (SPARK-42879) Spark SQL reads unnecessary nested fields

Jiri Humpolicek (Jira) Wed, 05 Feb 2025 22:17:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-42879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jiri Humpolicek updated SPARK-42879:
------------------------------------
    Affects Version/s: 3.5.4

> Spark SQL reads unnecessary nested fields
> -----------------------------------------
>
>                 Key: SPARK-42879
>                 URL: https://issues.apache.org/jira/browse/SPARK-42879
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.2, 4.0.0, 3.5.2, 3.5.4
>            Reporter: Jiri Humpolicek
>            Priority: Major
>
> When we use more than one field from structure after explode, all fields will 
> be read.
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>    {"itemId": 1, "itemData1": "a", "itemData2": 11},
>    {"itemId": 2, "itemData1": "b", "itemData2": 22}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read
>     .select(explode('items).as('item))
>     .select($"item.itemId", $"item.itemData1")
>     .explain
> // ReadSchema: 
> struct<items:array<struct<itemData1:string,itemData2:bigint,itemId:bigint>>>
> {code}
> We use only *itemId* and *itemData1* fields from structure in array, but read 
> schema contains *itemData2* field as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-42879) Spark SQL reads unnecessary nested fields

Reply via email to