Reo-LEI commented on a change in pull request #2898:
URL: https://github.com/apache/iceberg/pull/2898#discussion_r697882281



##########
File path: flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java
##########
@@ -369,6 +373,27 @@ private String operatorName(String suffix) {
         default:
           throw new RuntimeException("Unrecognized write.distribution-mode: " 
+ writeMode);
       }
+
+      if (keySelector != null) {
+        return input.keyBy(keySelector);
+      }
+      return input;
+    }
+
+    private KeySelector<RowData, String> getKeySelector(List<Integer> 
equalityFieldIds, PartitionSpec partitionSpec,
+        Schema schema, RowType rowType) {
+      boolean hasPrimaryKey = equalityFieldIds != null && 
!equalityFieldIds.isEmpty();
+      boolean hasPartitionKey = partitionSpec != null && 
!partitionSpec.isUnpartitioned();
+
+      if (hasPrimaryKey && hasPartitionKey) {
+        return new CombinedKeySelector(partitionSpec, equalityFieldIds, 
schema, rowType);

Review comment:
        As far as I know, between equality key and partition key are two 
independent fields, there is no restriction between them. So I add 
`CombinedKeySelector` to maintain the independence between them. But that will 
defeats the purpose of hash distribution as you say. 
   
   Restrict enable hash distribution mode when `equalityFieldIds` is not null 
is one way to reslove this problem, but should we restrict partition key must 
be a subset of equality key to reslove this? 
   
   Actualy, I have little confuse between equality key and partition key. I 
think partition key should be a subset of equality key, because when user 
define the equality key in partitioned table that is mean one row should be 
unique in each partition. Such as we got a table partitioned by <day, hour>, we 
should ensure equality key contain partition key like <day, hour, id>, but not 
like <id>. I think the later is confusing and make no sense.
   Just like a MySQL table which primary key is <userId>, and sharding data by 
<day, hour> to different table, and still require keep <userId> is unique in 
global.
   
   Once we can ensure partition key is a subset of equality key, we can remove 
the `CombinedKeySelector` and use equality key sheffle data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to