[
https://issues.apache.org/jira/browse/HUDI-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444218#comment-17444218
]
Ethan Guo commented on HUDI-2745:
---------------------------------
I check the {{MergeOnReadSnapshotRelation}} and file index built for the
kafka-connect case, it looks like the file slices containing the log files
after the pending compaction is missing. There is one exact issue being filed
and this is not limited to kafka-connect: HUDI-2480. And the clustering count
mismatch is likely due to this as well.
> Record count does not match input after compaction is scheduled when running
> Hudi Kafka Connect sink
> ----------------------------------------------------------------------------------------------------
>
> Key: HUDI-2745
> URL: https://issues.apache.org/jira/browse/HUDI-2745
> Project: Apache Hudi
> Issue Type: Bug
> Components: Compaction
> Reporter: Ethan Guo
> Assignee: Ethan Guo
> Priority: Blocker
> Fix For: 0.10.0
>
>
> Spark Shell command to do snapshot query:
> {code:java}
> val basePath = "/tmp/hoodie/hudi-test-topic"
> val df = spark.read.format("hudi").load(basePath)
> df.createOrReplaceTempView("hudi_test_table")
> spark.sql("select count(*) from hudi_test_table").show() {code}
> Two cases of count mismatch:
> (1) Compaction scheduled, more deltacommits later on: the count does not
> match input size. After compaction is executed. The count becomes correct.
> (2) Clustering scheduled, more deltacommits later on: the count is correct,
> equal to the input size. After clustering is executed, the count drops and
> becomes incorrect.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)