ranjani1993 opened a new issue, #6776: URL: https://github.com/apache/hudi/issues/6776
Hi Team, We are trying to implement HUDI for one of workflows in our project. The problem we are facing is we don't get only updated/changed records from source. We get the entire (unchanged + updated + new records) from source. Example: Source table has 1 billion records per partition Our target HUDI table has 1 billion records per partition Out of those 1 billion records in the source few records got updated. We don't know what are all the records got updated. So when we perform HUDI upsert operation on these 1 billion records in target against 1 billion records in source - HUDI is taking longer time than the regular overwrite operation (regular overwrite - in which we overwrite the entire partition in target table) We tried to apply optimisation by changing the index type to SIMPLE & other parallelism configs/ Spark configs. But we could not achieve the expected result. Just wanted to check, whether HUDI would be suitable for our usecase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
