xiearthur opened a new issue, #12660:
URL: https://github.com/apache/hudi/issues/12660
**Describe the problem you faced**
When using Flink to read a Hudi COW table in streaming mode, setting
READ_START_COMMIT shows different behaviors:
- With "earliest": can continuously read both historical and new data
- With specific timestamp: only reads data up to Flink job start time,
missing new data written after that
**To Reproduce**
```java
Map<String, String> options = new HashMap<>();
options.put(FlinkOptions.PATH.key(), basePath + tableName);
options.put(FlinkOptions.TABLE_TYPE.key(),
HoodieTableType.COPY_ON_WRITE.name());
options.put(FlinkOptions.READ_AS_STREAMING.key(), "true");
// Case 1: Works for continuous streaming but reads all history
options.put(FlinkOptions.READ_START_COMMIT.key(), "earliest");
// Case 2: Only reads data up to job start time
// options.put(FlinkOptions.READ_START_COMMIT.key(), "20240116000000");
HoodiePipeline.Builder builder = HoodiePipeline.builder(tableName)
.options(options);
DataStream<RowData> rowDataDS = builder.source(env);
```
**Expected behavior**
With specific READ_START_COMMIT timestamp, the streaming job should:
1. Start reading from the specified commit timestamp
2. Continue receiving new data written after job starts
**Environment Description**
* Hudi version: 0.14.0
* Flink version: 1.16.0
* Hadoop version: 3.1.0
* Storage: HDFS
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]