stevenzwu commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r559264472
##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
+ public static final String WRITE_SHUFFLE_BY_PARTITION =
"write.shuffle-by.partition";
Review comment:
I like Ryan's proposal. Maybe change the "partition" mode to
"hash-partition" just to be more accurate. Technically, "range-partition" in
the "sort" mode is also a "partition".
For the sort with unordered, "bin-packing" may be more optimal.
E.g., this is the traffic distribution for a partition column.
| | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Bucket | B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 |
| Weight | 1 | 10 | 1 | 10 | 1 | 10 | 1 | 10 | 1 | 10 |
Assuming writer parallelism is 11, each writer task should get data with
weight of 5 in a perfect world. For buckets with weight of 10, their data are
assigned 2 writer tasks. For buckets with weight of 1, they should be bundled
to a single writer task. Since this is a partition column, each partition key
got written to a separate file anyway. So it doesn't hurt data locality and can
improve balanced distribution among writer tasks.
If this is a non-partition column in ordered scenario, we may not want to
apply the bin-packing and range partition might be better. Otherwise, one
writer task will write a file with bucket 0, 2, 4, 6, 8. Since Iceberg only has
min-max column stats, it will result in (min=0, max=8) which is not great for
filtering. If Iceberg support `list-of-values` column stats, it might be useful
for some scenarios.
Not sure if it is a case of over engineering or not though.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]