Re: [PR] [flink] Introduce Range Partition And Sort in Append Scalable Table Batch Writing for Flink [paimon]

via GitHub Mon, 27 May 2024 23:16:45 -0700


WencongLiu commented on code in PR #3384:
URL: https://github.com/apache/paimon/pull/3384#discussion_r1616650922



##########
paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/sink/FlinkSinkBuilder.java:
##########
@@ -119,8 +133,86 @@ public FlinkSinkBuilder inputBounded(boolean bounded) {
         return this;
     }
 
+    /** Set the table sort info. */
+    public FlinkSinkBuilder setTableSortInfo(
+            String sortColumnsString,
+            String sortStrategy,
+            boolean sortInCluster,
+            int sampleFactor) {
+        // 1. The table sort will be ignored if the sort columns are not 
specified.
+        if (sortColumnsString == null || sortColumnsString.isEmpty()) {
+            return this;
+        }
+        // 2. Check the table type.
+        checkState(
+                table.bucketMode().equals(BUCKET_UNAWARE),
+                "Clustering only supports bucket unaware table without primary 
keys.");
+        // 3. Check the sort columns.
+        List<String> sortColumns = Arrays.asList(sortColumnsString.split(","));
+        List<String> fieldNames = table.schema().fieldNames();
+        checkState(
+                new HashSet<>(fieldNames).containsAll(new 
HashSet<>(sortColumns)),
+                String.format(
+                        "Field names %s should contains all clustering column 
names %s.",
+                        fieldNames, sortColumns));
+        // 4. Check the execution mode.
+        checkState(input != null, "The input stream should be specified 
earlier.");
+        if (boundedInput == null) {
+            boundedInput = !FlinkSink.isStreaming(input);
+        }
+        checkState(boundedInput, "The clustering should be executed under 
batch mode.");

Review Comment:
   @JingsongLi 
   Good point, ignoring clustering in stream mode is a sensible design. This 
avoids the need for users to manually adjust the table configuration under 
streaming mode.
   
   @xintongsong  
   1. I've removed the check and added releated warning log.
   2. I've added the limitations of table type and batch mode in configuration 
description. I've also added a commit to introduce the clustering feature in 
paimon docs.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [flink] Introduce Range Partition And Sort in Append Scalable Table Batch Writing for Flink [paimon]

Reply via email to