rshanmugam1 commented on PR #4692:
URL: https://github.com/apache/iceberg/pull/4692#issuecomment-1604626671
facing similar use-case to this.
the input data is sorted in a range using a custom partitioner. when another
writer reads the data, performs a simple transformation, and writes it back,
the sort order is not preserved. This issue arises because the number of splits
does not match the number of input files, which disrupts the range sort. Since
the data size is substantial, performing a shuffle operation is costly. tried
these options but did not help.
spark.read()
.option("file-open-cost", Long.MAX_VALUE) -- // creates 1 split per row
group..need 1 split per file
@aokolnychyi
"read.split.open-file-cost to Long.MaxValue to force one file per split" --
this makes one file per row group not file..
am i missing something here or another way to achieve this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]