[GitHub] [iceberg] rdblue commented on a change in pull request #2064: Flink: Support write.distribution-mode.

GitBox Thu, 21 Jan 2021 21:59:02 -0800


rdblue commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r562160730




##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
   public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
   public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
 
+  public static final String WRITE_SHUFFLE_BY_PARTITION = 
"write.shuffle-by.partition";

Review comment:
       @electrum, as far as what a "local sort" means, I think option 2 sounds 
good to me for a task-level sort. If that sort is needlessly expensive, then it 
is okay for Trino to skip it. But I think that if a table has a defined sort 
order, the right thing would be for Trino to apply it.
   
   For data distribution, it sounds like Trino will only support `none` and 
`hash` modes in the short term. That's reasonable given that you can't stage 
data and use it twice. Even with shuffle data reuse, global sort in Spark is 
quite expensive in some cases (doing a large join twice, for example). 
Eventually, we want to get to where the table metadata has a sketch of the data 
distribution so you can use that to get ranges for a global ordering.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #2064: Flink: Support write.distribution-mode.

Reply via email to