[
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294747#comment-17294747
]
Dongjoon Hyun commented on SPARK-29721:
---------------------------------------
Thanks, [~yuryn] . That sounds like another typo of issue. Could you file a new
issue for that? You can use the example and result on 3.1.1.
> Spark SQL reads unnecessary nested fields after using explode
> -------------------------------------------------------------
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Kai Kang
> Assignee: L. C. Hsieh
> Priority: Major
> Fix For: 3.1.0
>
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column
> pruning for nested structures. However, when explode() is called on a nested
> field, all columns for that nested structure is still fetched from data
> source.
> We are working on a project to create a parquet store for a big pre-joined
> table between two tables that has one-to-many relationship, and this is a
> blocking issue for us.
>
> The following code illustrates the issue.
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
> "items": [
> {"itemId": 1, "itemData": "a"},
> {"itemId": 2, "itemData": "b"}
> ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct<items:array<struct<itemId:bigint>>>
> read.select($"items.itemId").explain(true)
> // not pruned, loading both itemId
> // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]