RajasekarSribalan opened a new issue #1766:
URL: https://github.com/apache/hudi/issues/1766
Hi,
I have a spark batch job which pulls data from Hive and do a bulk insert in
Hudi table.This is for initial back fill.
Then I have a Spark streaming job, which reads data from Kafka and do an
upsert on the same Hudi table.
Isssue is, when I query via Hive after every streaming commit(spark
microbatch), the number of rows is getting doubled.
For example,
Number of rows after bulk insert in Hudi = 50,000
Number of rows after streaming commit @1 = 1,00,000
Number of rows after streaming commit @2 = 1,50,000
and so on...
When i query via Hive, it should always read the latest commit and give me
the results. But it always reads entire parquet files
PLease help to understand why hive query is behaving very strange
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]