[GitHub] [hudi] n3nash commented on issue #2806: Spark upsert Hudi performance degrades significantly

GitBox Mon, 26 Apr 2021 00:21:38 -0700


n3nash commented on issue #2806:
URL: https://github.com/apache/hudi/issues/2806#issuecomment-826576866



   @tmac2100 Yes, that is expected. The more number of fileIds touched by your 
updates, the higher the runtime. Although, there could be more factors 
governing this.
   
   1. Are you setting your bloom filter setting correctly ? Depending on the 
number of entries in your file, you need to set your bloom filter setting high 
enough to avoid false positives. Check here -> 
https://github.com/apache/hudi/blob/3e4fa170cfd2c198599c3bed3982f2f643c7fbe8/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java#L47
   2. You can try and increase the number of executors to help run bloom filter 
stages faster or turn on dynamicAllocation in spark and provide a min, max 
executors to let the bloom_index stage run faster.
   3. Try and experiment with other types of indexes, such as the SimpleIndex. 
   
   There is a record level index implementation underway which will ensure that 
the job runtime does not increase with more file ids.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] n3nash commented on issue #2806: Spark upsert Hudi performance degrades significantly

Reply via email to