rdblue commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r560371817
##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
+ public static final String WRITE_SHUFFLE_BY_PARTITION =
"write.shuffle-by.partition";
Review comment:
@stevenzwu, I used `partition` to describe distributing by partition
key. I'm trying to use "partition" to refer to table partitions, not
partitioning in an engine, which I would refer to as "distribution" in table
metadata. I'm making that distinction to avoid having multiple definitions of
"partition" that users need to think about, so "partition" should refer to
table partitioning and "distribution" would refer to how data is assigned to
tasks in an engine.
You also raise a good point about partitioning. If the number of partitions
is going to be the same order of magnitude as the number of writers, then a
hash assignment strategy could be a problem; some tasks could get two
partitions and some could get none. But, I'm not sure how to detect that case
and know the output size to use your bin-packing suggestion. We could add a
factor to distribute rows to N writers per partition to help balance the
hashing. And when the number of partitions is high, hash assignment would work
well. I think if we have a situation like you're describing, the `sort`
distribution mode is going to be the best option.
For now, I think the current `keyBy` implementation is a step in the right
direction. We can iterate to improve on it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]