aokolnychyi commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r559076627



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
   public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
   public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
 
+  public static final String WRITE_SHUFFLE_BY_PARTITION = 
"write.shuffle-by.partition";

Review comment:
       We have touched this topic a few times during the community syncs and 
the table property approach combined with the partition spec and sort order 
seems reasonable to me. I think we would want to send a mail to the dev list 
once we have a local consensus here to make sure other query engines are ok 
with our idea.
   
   I like the table property approach versus a job/session config because it is 
enough to change it in one place and the change will be automatically 
propagated to all jobs without any changes in the job/session config. That 
being said, I think there should be a way to override the distribution mode in 
a job. In Spark, we can do that with write options. I think the same applies to 
cases when different query engines need different behavior. For example, Flink 
may want to use `none` and Spark batch jobs may want `sort`. Then having a 
default value in table props and the ability to override in a job should be 
sufficient.
   
   I support the idea of having one table property for all query engines as it 
seems flexible enough to cover various cases (at least, for Flink and Spark). 
If others have concerns, I'll be okay with engine-specific properties as well. 
I also understand that not all query engines will support this property right 
away. So having separate props may actually give us some benefits.
   
   @rdblue, thanks for the table! I think for Spark, if the table is unordered, 
we would want to add a local sort based on partition columns in all modes with 
a write option to disable inference. What do you think?
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to