[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

via GitHub Wed, 08 Feb 2023 22:53:01 -0800


SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101042060



##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws 
Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ 
options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   @hbgstc123, the reason of the above suggestion is that `execSelectSql ` will 
start a Flink job to collect the data of the table t1 and useful for stream 
reading, and the batch reading only uses the `streamTableEnv.sqlQuery` to get 
the data of the table t1. Otherwise the IT case would run failed. You could 
locally run this IT case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Reply via email to