Reo-LEI commented on a change in pull request #2898:
URL: https://github.com/apache/iceberg/pull/2898#discussion_r697882281
##########
File path: flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java
##########
@@ -369,6 +373,27 @@ private String operatorName(String suffix) {
default:
throw new RuntimeException("Unrecognized write.distribution-mode: "
+ writeMode);
}
+
+ if (keySelector != null) {
+ return input.keyBy(keySelector);
+ }
+ return input;
+ }
+
+ private KeySelector<RowData, String> getKeySelector(List<Integer>
equalityFieldIds, PartitionSpec partitionSpec,
+ Schema schema, RowType rowType) {
+ boolean hasPrimaryKey = equalityFieldIds != null &&
!equalityFieldIds.isEmpty();
+ boolean hasPartitionKey = partitionSpec != null &&
!partitionSpec.isUnpartitioned();
+
+ if (hasPrimaryKey && hasPartitionKey) {
+ return new CombinedKeySelector(partitionSpec, equalityFieldIds,
schema, rowType);
Review comment:
As far as I know, between equality key and partition key are two
independent fields, there is no restriction between them. So I add
`CombinedKeySelector` to maintain the independence between them. But that will
defeats the purpose of hash distribution as you say.
restrict enable hash distribution mode when `equalityFieldIds` is not null
is one way to reslove this problem, but I think should we restrict partition
key must be a subset of equality key to reslove this? Actualy, I have little
confuse between equality key and partition key.
I think partition key should be a subset of equality key, because when user
define the equality key in partitioned table that is mean one row should be
unique in each partition. Such as we got a table partitioned by <day, hour>, we
should ensure equality key contain partition key like <day, hour, id>, but not
like <id>. I think the later is confusing and make no sense.
Just like a MySQL table which primary key is <userId>, and sharding data by
<day, hour> to different table, and still require keep <userId> is unique in
global.
Once we can ensure partition key is a subset of equality key, we can remove
the `CombinedKeySelector` and use equality key sheffle data.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]