openinx commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r560641150
##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
+ public static final String WRITE_SHUFFLE_BY_PARTITION =
"write.shuffle-by.partition";
Review comment:
> Flink may eventually provide a way to order within data files, but I
think that is less important than clustering data across files so that data
files can be skipped in queries.
Agreed. Though sorting within data file would be really helpful for page
skipping, but that would introduce more cost for streaming processing job.
Range distribution by sorted keys is some kind of coarse granularity, but it's
good enough for streaming job to cluster keys for filtering among data files,
I think it's a better balanced choice when trade off between write efficiency
and read performances.
It make sense to me that rewriting those range distributed data files into
row-ordering files if there're heavy reads that depends on them.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]