openinx commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r557109188



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
   public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
   public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
 
+  public static final String WRITE_SHUFFLE_BY_PARTITION = 
"write.shuffle-by.partition";

Review comment:
       > Otherwise, heavy data skew can be problematic for writer.
   
   @stevenzwu  We've considered this data skew issue. In my mind,  it's 
recommended to define bucket in table's PartitionSpec, for example: 
   
   ```java
       PartitionSpec spec = PartitionSpec.builderFor(table.schema())
           .day("ts")
           .bucket("id", 16)
           .build();
   ```
   
   Then the currently key-by method  (The 
[partitionKey#toPath](https://github.com/apache/iceberg/pull/2064/files#diff-0fa7d66fbfe363dd2992c26a69e3f29b631533fe1c7ab549e83c1f3f0d49153dR62)
 is actually the ".../ts_day=2020-01-01/bucket-14/..." ) will dispatch the 
input records into different bucket randomly.   I mean if people defines their 
buckets under partition path,  then we don't have to introduce such a complex 
dispatch-policy ( collecting the partition's statistic and dispatch records 
based on a weight value ).    Does that make sense for you ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to