jbewing opened a new issue, #14680:
URL: https://github.com/apache/iceberg/issues/14680
### Apache Iceberg version
1.10.0 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
#### What
Currently, when creating a view via Apache Spark—I'm running a fork of
3.5—I've observed that table view creation via
```
SparkSession spark = ...;
spark.sql("CREATE VIEW ... ")
```
is slow and this slowness generally seems to scale with the size of the
table. When creating views over a larger table (hundreds of TBs), creating a
small number of views (say just a couple thousand) takes about ~12 hours and
requires a moderately sized Spark cluster (~100 CPUs).
This seems excessive and based on logs and flamegraphs, we can see that the
overhead here is due to the fact that Spark attempts to perform excessive
optimization on the view body. For this given unit test:
```java
@TestTemplate
public void createViewWithGroupByOrdinal() throws NoSuchTableException {
insertRows(3);
insertRows(2);
String viewName = viewName("createViewWithGroupByOrdinal");
sql("CREATE VIEW %s AS SELECT id, count(1) FROM %s GROUP BY 1",
viewName, tableName);
assertThat(sql("SELECT * FROM %s", viewName))
.hasSize(3)
.containsExactlyInAnyOrder(row(1, 2L), row(2, 2L), row(3, 1L));
}
```
You can observe logs such as:
```
[Test worker] INFO
org.apache.spark.sql.execution.datasources.v2.AppendDataExec - Data source
write support IcebergBatchWrite(table=spark_with_views.default.table,
format=PARQUET) committed.
[Test worker] INFO org.apache.iceberg.BaseMetastoreTableOperations -
Refreshing table metadata from new version:
/default/table/metadata/00002-b53fe6fa-4245-4fb3-abea-81a98c8cc849.metadata.json
[Test worker] INFO org.apache.iceberg.BaseMetastoreCatalog - Table loaded by
catalog: spark_with_views.default.table
[Test worker] INFO org.apache.iceberg.spark.source.SparkScanBuilder -
Skipping aggregate pushdown: group by aggregation push down is not supported
[Test worker] INFO
org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown -
Output: id#13
[Test worker] INFO org.apache.iceberg.SnapshotScan - Scanning table
spark_with_views.default.table snapshot 2224231829492223177 created at
2025-11-25T03:47:31.504+00:00 with filter true
[Test worker] INFO org.apache.iceberg.BaseDistributedDataScan - Planning
file tasks locally for table spark_with_views.default.table
[Test worker] INFO
org.apache.iceberg.spark.source.SparkPartitioningAwareScan - Reporting
UnknownPartitioning with 1 partition(s) for table spark_with_views.default.table
[Test worker] INFO org.apache.iceberg.view.BaseMetastoreViewCatalog - View
properties set at catalog level through catalog properties: {}
[Test worker] INFO org.apache.iceberg.view.BaseMetastoreViewCatalog - View
properties enforced at catalog level through catalog properties: {}
[Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Successfully
committed to view spark_with_views.default.createViewWithGroupByOrdinal989199
in 2 ms
[Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Refreshing
view metadata from new version:
/default/createViewWithGroupByOrdinal989199/metadata/00000-66728e6c-ae56-400f-96dd-dbd36dba8ed9.gz.metadata.json
[Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Refreshing
view metadata from new version:
/default/createViewWithGroupByOrdinal989199/metadata/00000-66728e6c-ae56-400f-96dd-dbd36dba8ed9.gz.metadata.json
[Test worker] INFO org.apache.iceberg.BaseMetastoreTableOperations -
Refreshing table metadata from new version:
/default/table/metadata/00002-b53fe6fa-4245-4fb3-abea-81a98c8cc849.metadata.json
[Test worker] INFO org.apache.iceberg.BaseMetastoreCatalog - Table loaded by
catalog: spark_with_views.default.table
[Test worker] INFO org.apache.iceberg.spark.source.SparkScanBuilder -
Skipping aggregate pushdown: group by aggregation push down is not supported
[Test worker] INFO
org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown -
```
which shows the table optimization step running before view creation which
requires a heavier-weight scan operation that may require distributed planning.
#### Why
Upon looking upstream at Spark, we can see that for similar pieces of the
view creation logic that views have some explicit code that enables the view
body to be analyzed, but not optimized.
-
[CreateViewCommand](https://github.com/apache/spark/blob/3f5a2b9fc3344b397320c8f507dd1ab9068e99d5/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala#L44-L95)
-
[AnalysisOnlyCommand](https://github.com/apache/spark/blob/3f5a2b9fc3344b397320c8f507dd1ab9068e99d5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Command.scala#L45-L65)
In Iceberg, we hijack the regular upstream Spark code and run our own
variants of view creation that don't pull in this optimization. Given that a
table scan planning is both redundant and slow in this case, we should update
the internal Iceberg view creation code to only be used in Spark analysis, but
not optimization phases.
### Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]