[I] [Spark][Feature] Create or replace with ordering [iceberg]

via GitHub Tue, 18 Nov 2025 03:37:32 -0800


hide42 opened a new issue, #14612:
URL: https://github.com/apache/iceberg/issues/14612


   ### Feature Request / Improvement
   
   **Problem Statement**
   
   Currently, Apache Iceberg does not support specifying sort order directly in 
CREATE TABLE ... AS SELECT (CTAS) and REPLACE TABLE ... AS SELECT (RTAS) 
operations in Spark SQL. The only way to set a sort order is through a separate 
ALTER TABLE ... WRITE ORDERED BY command after table creation/replacement.
   
   **This limitation creates several operational challenges:**
   Current workarounds for full table refresh with sort order:
   Option A:
   ```
   -- Step 1: Create/replace empty table
   REPLACE TABLE db.sample AS SELECT * FROM source_table where 1=0;  
   
   -- Step 2: Set sort order 
   ALTER TABLE db.sample WRITE ORDERED BY category, id;  
     
   -- Step 3: Overwrite all data to apply sorting  
   INSERT TABLE db.sample SELECT * FROM source_table;
   ```
   Main problem with this approach:
   Table is temporarily empty between steps 1 and 3, making it unavailable to 
users
   
   Option B:
   ```
   -- Step 1: Create/replace table without sort order  
   REPLACE TABLE prod.db.sample AS SELECT * FROM source_table;  
   
   -- Step 2: rewrite with sort
   CALL catalog_name.system.rewrite_data_files(  
     table => 'db.sample',   
     strategy => 'sort',   
     sort_order => 'category ASC,id ASC',  
     options => map('rewrite-all', 'true')  
   );
   ```
   Main problem with this approach:
   Data is written twice: once unsorted, then again sorted
   
   Opion C: 
   Manual range distribution with bucket hints: Extremely complex.Not 
guaranteed to work. 
   
   This workarounds demonstrates the complexity users face without native 
SQL/(+Pyspark DF API) syntax support for sort order in CTAS/RTAS.
   
   Proposed Solution:
   Like Hive-table syntax without CLUSTERED BY:
   ```
   REPLACE TABLE db.sample  
   USING iceberg  
   WRITE ORDERED BY category ASC,id ASC
   AS SELECT * FROM source_table;
   ```
   From tblproperties:
   ```
   REPLACE TABLE prod.db.sample  
   USING iceberg  
   TBLPROPERTIES ('sort-order'='category ASC,id ASC')  
   AS SELECT ...
   ```
   
   Before proceeding with this feature request, I'd like to confirm:
   Is there truly no workaround?
   Why isn't this supported? (This seems like a fundamental feature for data 
lake operations. I'm curious if there are)
   Community need? (Are other users experiencing this limitation?)
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [x] I can contribute this improvement/feature independently
   - [x] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [x] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Spark][Feature] Create or replace with ordering [iceberg]

Reply via email to