[GitHub] [iceberg] rshanmugam1 commented on pull request #4692: WIP: Remove redundant sorts from copy on write deletes

via GitHub Fri, 23 Jun 2023 10:50:10 -0700


rshanmugam1 commented on PR #4692:
URL: https://github.com/apache/iceberg/pull/4692#issuecomment-1604626671


   facing  similar use-case to this.
   
   the input data is sorted in a range using a custom partitioner. when another 
writer reads the data, performs a simple transformation, and writes it back, 
the sort order is not preserved. This issue arises because the number of splits 
does not match the number of input files, which disrupts the range sort. Since 
the data size is substantial, performing a shuffle operation is costly. tried 
these options but did not help.
   
   spark.read()
       .option("file-open-cost", Long.MAX_VALUE) -- //  creates 1 split per row 
group..need 1 split per file
   
   @aokolnychyi 
   "read.split.open-file-cost to Long.MaxValue to force one file per split" -- 
this makes one file per row group not file.. 
   am i missing something here or another way to achieve this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rshanmugam1 commented on pull request #4692: WIP: Remove redundant sorts from copy on write deletes

Reply via email to