aokolnychyi commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r555685966



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
   public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
   public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
 
+  public static final String WRITE_SHUFFLE_BY_PARTITION = 
"write.shuffle-by.partition";

Review comment:
       I have the same concern as @stevenzwu that a hash distribution by 
partition spec would co-locate all entries for the same partition in the same 
task, potentially leading to having too much data in a task. The global sort in 
Spark would be a better option here for batch jobs as it will do skew 
estimation and the sort order can be used to split data for the same partition 
across multiple tasks.
   
   To sum up, I think we should be flexible and support 3 modes to cover 
different use cases.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to