[GitHub] [iceberg] openinx commented on a change in pull request #2064: Flink: Add option to shuffle by partition key in iceberg sink.

GitBox Wed, 13 Jan 2021 23:36:49 -0800


openinx commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r557109188




##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
   public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
   public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
 
+  public static final String WRITE_SHUFFLE_BY_PARTITION = 
"write.shuffle-by.partition";

Review comment:
       > Otherwise, heavy data skew can be problematic for writer.
   
   @stevenzwu  We've considered this data skew issue. In my mind,  it's 
recommended to define bucket in table's PartitionSpec, for example: 
   
   ```java
       PartitionSpec spec = PartitionSpec.builderFor(table.schema())
           .day("ts")
           .bucket("id", 16)
           .build();
   ```
   
   Then the currently key-by method  (The 
[partitionKey#toPath](https://github.com/apache/iceberg/pull/2064/files#diff-0fa7d66fbfe363dd2992c26a69e3f29b631533fe1c7ab549e83c1f3f0d49153dR62)
 is actually the ".../ts_day=2020-01-01/bucket-14/..." ) will dispatch the 
input records into different bucket randomly.   I mean if people defines their 
buckets under partition path,  then we don't have to introduce such a complex 
dispatch-policy ( collecting the partition's statistic and dispatch records 
based on a weight value ).    Does that make sense for you ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] openinx commented on a change in pull request #2064: Flink: Add option to shuffle by partition key in iceberg sink.

Reply via email to