RajasekarSribalan opened a new issue #1766:
URL: https://github.com/apache/hudi/issues/1766


   Hi,
   
   I have a spark batch job which pulls data from Hive and do a bulk insert in 
Hudi table.This is for initial back fill.
   
   Then I have a Spark streaming job, which reads data from Kafka and do an 
upsert on the same Hudi table.
   
   Isssue is, when I query via Hive after every streaming commit(spark 
microbatch), the number of rows is getting doubled.
   
   For example,
   
   Number of rows after bulk insert in Hudi = 50,000
   Number of rows after streaming commit @1 =  1,00,000
   Number of rows after streaming commit @2 =  1,50,000
   
   and so on...
    
   When i query via Hive, it should always read the latest commit and give me 
the results. But it always reads entire parquet files
   
   PLease help to understand why hive query is behaving very strange
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to