jbewing opened a new issue, #14680:
URL: https://github.com/apache/iceberg/issues/14680

   ### Apache Iceberg version
   
   1.10.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   #### What
   Currently, when creating a view via Apache Spark—I'm running a fork of 
3.5—I've observed that table view creation via 
   ```
   SparkSession spark = ...;
   spark.sql("CREATE VIEW ... ")
   ```
   
   is slow and this slowness generally seems to scale with the size of the 
table. When creating views over a larger table (hundreds of TBs), creating a 
small number of views (say just a couple thousand) takes about ~12 hours and 
requires a moderately sized Spark cluster (~100 CPUs).
   
   This seems excessive and based on logs and flamegraphs, we can see that the 
overhead here is due to the fact that Spark attempts to perform excessive 
optimization on the view body. For this given unit test:
   ```java
     @TestTemplate
     public void createViewWithGroupByOrdinal() throws NoSuchTableException {
       insertRows(3);
       insertRows(2);
       String viewName = viewName("createViewWithGroupByOrdinal");
       sql("CREATE VIEW %s AS SELECT id, count(1) FROM %s GROUP BY 1", 
viewName, tableName);
   
       assertThat(sql("SELECT * FROM %s", viewName))
           .hasSize(3)
           .containsExactlyInAnyOrder(row(1, 2L), row(2, 2L), row(3, 1L));
     }
   ```
   
   You can observe logs such as:
   ```
   [Test worker] INFO 
org.apache.spark.sql.execution.datasources.v2.AppendDataExec - Data source 
write support IcebergBatchWrite(table=spark_with_views.default.table, 
format=PARQUET) committed.
   [Test worker] INFO org.apache.iceberg.BaseMetastoreTableOperations - 
Refreshing table metadata from new version: 
/default/table/metadata/00002-b53fe6fa-4245-4fb3-abea-81a98c8cc849.metadata.json
   [Test worker] INFO org.apache.iceberg.BaseMetastoreCatalog - Table loaded by 
catalog: spark_with_views.default.table
   [Test worker] INFO org.apache.iceberg.spark.source.SparkScanBuilder - 
Skipping aggregate pushdown: group by aggregation push down is not supported
   [Test worker] INFO 
org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown - 
   Output: id#13
              
   [Test worker] INFO org.apache.iceberg.SnapshotScan - Scanning table 
spark_with_views.default.table snapshot 2224231829492223177 created at 
2025-11-25T03:47:31.504+00:00 with filter true
   [Test worker] INFO org.apache.iceberg.BaseDistributedDataScan - Planning 
file tasks locally for table spark_with_views.default.table
   [Test worker] INFO 
org.apache.iceberg.spark.source.SparkPartitioningAwareScan - Reporting 
UnknownPartitioning with 1 partition(s) for table spark_with_views.default.table
   [Test worker] INFO org.apache.iceberg.view.BaseMetastoreViewCatalog - View 
properties set at catalog level through catalog properties: {}
   [Test worker] INFO org.apache.iceberg.view.BaseMetastoreViewCatalog - View 
properties enforced at catalog level through catalog properties: {}
   [Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Successfully 
committed to view spark_with_views.default.createViewWithGroupByOrdinal989199 
in 2 ms
   [Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Refreshing 
view metadata from new version: 
/default/createViewWithGroupByOrdinal989199/metadata/00000-66728e6c-ae56-400f-96dd-dbd36dba8ed9.gz.metadata.json
   [Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Refreshing 
view metadata from new version: 
/default/createViewWithGroupByOrdinal989199/metadata/00000-66728e6c-ae56-400f-96dd-dbd36dba8ed9.gz.metadata.json
   [Test worker] INFO org.apache.iceberg.BaseMetastoreTableOperations - 
Refreshing table metadata from new version: 
/default/table/metadata/00002-b53fe6fa-4245-4fb3-abea-81a98c8cc849.metadata.json
   [Test worker] INFO org.apache.iceberg.BaseMetastoreCatalog - Table loaded by 
catalog: spark_with_views.default.table
   [Test worker] INFO org.apache.iceberg.spark.source.SparkScanBuilder - 
Skipping aggregate pushdown: group by aggregation push down is not supported
   [Test worker] INFO 
org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown - 
   ```
   
   which shows the table optimization step running before view creation which 
requires a heavier-weight scan operation that may require distributed planning.
   
   #### Why
   
   Upon looking upstream at Spark, we can see that for similar pieces of the 
view creation logic that views have some explicit code that enables the view 
body to be analyzed, but not optimized.
   - 
[CreateViewCommand](https://github.com/apache/spark/blob/3f5a2b9fc3344b397320c8f507dd1ab9068e99d5/sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala#L44-L95)
   - 
[AnalysisOnlyCommand](https://github.com/apache/spark/blob/3f5a2b9fc3344b397320c8f507dd1ab9068e99d5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Command.scala#L45-L65)
   
   In Iceberg, we hijack the regular upstream Spark code and run our own 
variants of view creation that don't pull in this optimization. Given that a 
table scan planning is both redundant and slow in this case, we should update 
the internal Iceberg view creation code to only be used in Spark analysis, but 
not optimization phases.
   
   ### Willingness to contribute
   
   - [x] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to