hide42 opened a new issue, #14612:
URL: https://github.com/apache/iceberg/issues/14612
### Feature Request / Improvement
**Problem Statement**
Currently, Apache Iceberg does not support specifying sort order directly in
CREATE TABLE ... AS SELECT (CTAS) and REPLACE TABLE ... AS SELECT (RTAS)
operations in Spark SQL. The only way to set a sort order is through a separate
ALTER TABLE ... WRITE ORDERED BY command after table creation/replacement.
**This limitation creates several operational challenges:**
Current workarounds for full table refresh with sort order:
Option A:
```
-- Step 1: Create/replace empty table
REPLACE TABLE db.sample AS SELECT * FROM source_table where 1=0;
-- Step 2: Set sort order
ALTER TABLE db.sample WRITE ORDERED BY category, id;
-- Step 3: Overwrite all data to apply sorting
INSERT TABLE db.sample SELECT * FROM source_table;
```
Main problem with this approach:
Table is temporarily empty between steps 1 and 3, making it unavailable to
users
Option B:
```
-- Step 1: Create/replace table without sort order
REPLACE TABLE prod.db.sample AS SELECT * FROM source_table;
-- Step 2: rewrite with sort
CALL catalog_name.system.rewrite_data_files(
table => 'db.sample',
strategy => 'sort',
sort_order => 'category ASC,id ASC',
options => map('rewrite-all', 'true')
);
```
Main problem with this approach:
Data is written twice: once unsorted, then again sorted
Opion C:
Manual range distribution with bucket hints: Extremely complex.Not
guaranteed to work.
This workarounds demonstrates the complexity users face without native
SQL/(+Pyspark DF API) syntax support for sort order in CTAS/RTAS.
Proposed Solution:
Like Hive-table syntax without CLUSTERED BY:
```
REPLACE TABLE db.sample
USING iceberg
WRITE ORDERED BY category ASC,id ASC
AS SELECT * FROM source_table;
```
From tblproperties:
```
REPLACE TABLE prod.db.sample
USING iceberg
TBLPROPERTIES ('sort-order'='category ASC,id ASC')
AS SELECT ...
```
Before proceeding with this feature request, I'd like to confirm:
Is there truly no workaround?
Why isn't this supported? (This seems like a fundamental feature for data
lake operations. I'm curious if there are)
Community need? (Are other users experiencing this limitation?)
### Query engine
Spark
### Willingness to contribute
- [x] I can contribute this improvement/feature independently
- [x] I would be willing to contribute this improvement/feature with
guidance from the Iceberg community
- [x] I cannot contribute this improvement/feature at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]