shengchiqu commented on issue #7229:
URL: https://github.com/apache/hudi/issues/7229#issuecomment-1384871989

   @yihua thanks.My problem is that if changelog.enabled is true, flink 
incremental streaming reads are fine, but if spark/hive/flink is used to read 
the hudi directory table offline(batch read), there will be duplicates, all 
updates exist, and there is no de-duplication. Is it mutually exclusive if you 
want to use both incremental stream read and batch read?
   
   changelog.enabled=true => flink incr streaming shows the correct cdc; batch 
read is duplicate
   ```shell
   
+-------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+-------------------+--------------------------------+--------------------------------+-------------------------+
   |   C_CUSTKEY |                         C_NAME |                      
C_ADDRESS |                    C_NATIONKEY |                        C_PHONE |   
      C_ACCTBAL |                   C_MKTSEGMENT |                      
C_COMMENT |                      ts |
   
+-------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+-------------------+--------------------------------+--------------------------------+-------------------------+
   |           1 |             Customer#000000001 |                             
 a |                              1 |                25-989-741-2988 |          
  711.56 |                       BUILDING | to the even, regular platel... | 
2023-01-17 13:43:25.380 |
   |           1 |             Customer#000000001 |                             
 a |                             12 |                25-989-741-2988 |          
  711.56 |                       BUILDING | to the even, regular platel... | 
2023-01-17 13:43:31.383 |
   |           1 |             Customer#000000001 |                             
 a |                              2 |                25-989-741-2988 |          
  711.56 |                       BUILDING | to the even, regular platel... | 
2023-01-17 13:43:28.381 |
   |           1 |             Customer#000000001 |                             
 a |                              3 |                25-989-741-2988 |          
  711.56 |                       BUILDING | to the even, regular platel... | 
2023-01-17 13:43:31.383 |
   
+-------------+--------------------------------+--------------------------------+--------------------------------+--------------------------------+-------------------+--------------------------------+--------------------------------+-------------------------+
   ```
   
   changelog.enabled=false => flink incr streaming is error; batch read is 
no-deplicate and the data is accurate
   ```shell
   Caused by: java.lang.IllegalStateException: Not expected to see delete 
records in this log-scan mode. Check Job Config
        at 
org.apache.hudi.common.table.log.HoodieUnMergedLogRecordScanner.processNextDeletedRecord(HoodieUnMergedLogRecordScanner.java:60)
        at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
        at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
        at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:473)
        at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:343)
        ... 10 more
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to