ranjani1993 opened a new issue, #6776:
URL: https://github.com/apache/hudi/issues/6776

   Hi Team,
   
   We are trying to implement HUDI for one of workflows in our project.
   
   The problem we are facing is we don't get only updated/changed records from 
source. We get the entire (unchanged + updated + new records) from source.
   
   Example:
   
   Source table has 1 billion records per partition
   Our target HUDI table has 1 billion records per partition
   
   Out of those 1 billion records in the source few records got updated. We 
don't know what are all the records got updated.
   
   So when we perform HUDI upsert operation on these 1 billion records in 
target against 1 billion records in source - HUDI is taking longer time than 
the regular overwrite operation (regular overwrite - in which we overwrite the 
entire partition in target table)
   
   We tried to apply optimisation by changing the index type to SIMPLE & other 
parallelism configs/ Spark configs. But we could not achieve the expected 
result.
   
   Just wanted to check, whether HUDI would be suitable for our usecase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to