[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2064: Flink: Add option to shuffle by partition key in iceberg sink.

GitBox Tue, 12 Jan 2021 02:59:39 -0800


aokolnychyi commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r555685966




##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
   public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
   public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
 
+  public static final String WRITE_SHUFFLE_BY_PARTITION = 
"write.shuffle-by.partition";

Review comment:
       I have the same concern as @stevenzwu that a hash distribution by 
partition spec would co-locate all entries for the same partition in the same 
task, potentially leading to having too much data in a task. The global sort in 
Spark would be a better option here for batch jobs as it will do skew 
estimation and the sort order can be used to split data for the same partition 
across multiple tasks.
   
   To sum up, I think we should be flexible and support 3 modes to cover 
different use cases.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2064: Flink: Add option to shuffle by partition key in iceberg sink.

Reply via email to