aokolnychyi commented on a change in pull request #2064:
URL: https://github.com/apache/iceberg/pull/2064#discussion_r559076627
##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -138,6 +138,9 @@ private TableProperties() {
public static final String ENGINE_HIVE_ENABLED = "engine.hive.enabled";
public static final boolean ENGINE_HIVE_ENABLED_DEFAULT = false;
+ public static final String WRITE_SHUFFLE_BY_PARTITION =
"write.shuffle-by.partition";
Review comment:
We have touched this topic a few times during the community syncs and
the table property approach combined with the partition spec and sort order
seems reasonable to me. I think we would want to send a mail to the dev list
once we have a local consensus here to make sure other query engines are ok
with our idea.
I like the table property approach versus a job/session config because it is
enough to change it in one place and the change will be automatically
propagated to all jobs without any changes in the job/session config. That
being said, I think there should be a way to override the distribution mode in
a job. In Spark, we can do that with write options. I think the same applies to
cases when different query engines need different behavior. For example, Flink
may want to use `none` and Spark batch jobs may want `sort`. Then having a
default value in table props and the ability to override in a job should be
sufficient.
I support the idea of having one table property for all query engines as it
seems flexible enough to cover various cases (at least, for Flink and Spark).
If others have concerns, I'll be okay with engine-specific properties as well.
I also understand that not all query engines will support this property right
away.
@rdblue, thanks for the table! I think for Spark, if the table is unordered,
we would want to add a local sort based on partition columns in all modes with
a write option to disable inference. What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]