Hi everyone, It seems that there's no rule to enforce sort order and distribution when using "CREATE TABLE table PARTITIONED BY (...) AS (SELECT ... FROM );" statements. With iceberg, partitioned tables have a distribution requirement[1] and it would be nice to have those automatically applied just like when inserting to an existing table.
Looking at "org.apache.spark.sql.execution.datasources.v2.V2Writes" there's a bunch of rules that enforce the requirements but there's no rule for "CreateTableAsSelect" logical plan. I've started writing a patch for this but I've a hit a kind of chicken and egg problem with the table. All others V2Writes rules have an existing relation, but that's not the case for CreateTableAsSelect (makes sense since the table does not exist yet). This is tricky because the distribution is derived from the table. I probably missed something, but I couldn't find a clean way to implement this. I have a few ideas(probably bad), like maybe adding a new catalog interface that reports the requirements based on the partition spec and table properties(which are available to "CreateTableAsSelect"). I'd love to hear some opinions on this issue [1] https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables