RajasekarSribalan opened a new issue #1766: URL: https://github.com/apache/hudi/issues/1766
Hi, I have a spark batch job which pulls data from Hive and do a bulk insert in Hudi table.This is for initial back fill. Then I have a Spark streaming job, which reads data from Kafka and do an upsert on the same Hudi table. Isssue is, when I query via Hive after every streaming commit(spark microbatch), the number of rows is getting doubled. For example, Number of rows after bulk insert in Hudi = 50,000 Number of rows after streaming commit @1 = 1,00,000 Number of rows after streaming commit @2 = 1,50,000 and so on... When i query via Hive, it should always read the latest commit and give me the results. But it always reads entire parquet files PLease help to understand why hive query is behaving very strange ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org