shengchiqu commented on issue #7229:
URL: https://github.com/apache/hudi/issues/7229#issuecomment-1384871989
@yihua thanks.My problem is that if changelog.enabled is true, flink
incremental streaming reads are fine, but if spark/hive/flink is used to read
the hudi directory table offline(batch read), there will be duplicates, all
updates exist, and there is no de-duplication. Is it mutually exclusive if you
want to use both incremental stream read and batch read?
changelog.enabled=true => flink incr streaming shows the correct cdc; batch
read is duplicate
```shell
+-------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+-------------------+--------------------------------+--------------------------------+-------------------------+
| C_CUSTKEY | C_NAME |
C_ADDRESS | C_NATIONKEY | C_PHONE |
C_ACCTBAL | C_MKTSEGMENT |
C_COMMENT | ts |
+-------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+-------------------+--------------------------------+--------------------------------+-------------------------+
| 1 | Customer#000000001 |
a | 1 | 25-989-741-2988 |
711.56 | BUILDING | to the even, regular platel... |
2023-01-17 13:43:25.380 |
| 1 | Customer#000000001 |
a | 12 | 25-989-741-2988 |
711.56 | BUILDING | to the even, regular platel... |
2023-01-17 13:43:31.383 |
| 1 | Customer#000000001 |
a | 2 | 25-989-741-2988 |
711.56 | BUILDING | to the even, regular platel... |
2023-01-17 13:43:28.381 |
| 1 | Customer#000000001 |
a | 3 | 25-989-741-2988 |
711.56 | BUILDING | to the even, regular platel... |
2023-01-17 13:43:31.383 |
+-------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+-------------------+--------------------------------+--------------------------------+-------------------------+
```
changelog.enabled=false => flink incr streaming is error; batch read is
no-deplicate and the data is accurate
```shell
Caused by: java.lang.IllegalStateException: Not expected to see delete
records in this log-scan mode. Check Job Config
at
org.apache.hudi.common.table.log.HoodieUnMergedLogRecordScanner.processNextDeletedRecord(HoodieUnMergedLogRecordScanner.java:60)
at
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:473)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:343)
... 10 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]