[GitHub] [iceberg] robert8138 commented on pull request #3461: Spark: Request distribution and ordering for writes

GitBox Thu, 08 Sep 2022 18:45:24 -0700


robert8138 commented on PR #3461:
URL: https://github.com/apache/iceberg/pull/3461#issuecomment-1241406799


   Seeing Ryan's comment. it seems like `ALTER TABLE ... WRITE ORDER BY ...` 
SQL extension actually does not work for `INSERT` before Spark 3.2. I've tried 
it myself (we are on Spark 3.1) and it didn't work. Has there been any 
development on this extension recently?
   
   Additionally, we've tried a few other suggestions Ryan gave in the other 
threads:
   
   * `fan out writer` - This didn't work for us, possibly because we have DS 
partitioned table that can go back to as early as 2008, so we are opening 
thousands of files and the overhead is too big.
   * `sorting` - we tried out global sort (`ORDER BY`), local sort (`SORT BY`), 
and also `CLUSTER BY`, and they all worked, but with varying cost & 
performance! 
   
   I heard that this will be handled transparently starting in Spark 3.3 and 
users no longer have to explicitly sort their data before INSERT into a 
partitioned table. Is that correct? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] robert8138 commented on pull request #3461: Spark: Request distribution and ordering for writes

Reply via email to