openinx commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r557109188
##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
+ public static final String WRITE_SHUFFLE_BY_PARTITION =
"write.shuffle-by.partition";
Review comment:
> Otherwise, heavy data skew can be problematic for writer.
@stevenzwu We've considered this data skew issue. In my mind, it's
recommended to define bucket in table's PartitionSpec, for example:
```java
PartitionSpec spec = PartitionSpec.builderFor(table.schema())
.day("ts")
.bucket("id", 16)
.build();
```
Then the currently key-by method (The
[partitionKey#toPath](https://github.com/apache/iceberg/pull/2064/files#diff-0fa7d66fbfe363dd2992c26a69e3f29b631533fe1c7ab549e83c1f3f0d49153dR62)
is actually the ".../ts_day=2020-01-01/bucket-14/..." ) will dispatch the
input records into different bucket randomly. I mean if people defines their
buckets under partition path, then we don't have to introduce such a complex
dispatch-policy ( collecting the partition's statistic and dispatch records
based on a weight value ). Does that make sense for you ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]