n3nash commented on issue #2806: URL: https://github.com/apache/hudi/issues/2806#issuecomment-826576866
@tmac2100 Yes, that is expected. The more number of fileIds touched by your updates, the higher the runtime. Although, there could be more factors governing this. 1. Are you setting your bloom filter setting correctly ? Depending on the number of entries in your file, you need to set your bloom filter setting high enough to avoid false positives. Check here -> https://github.com/apache/hudi/blob/3e4fa170cfd2c198599c3bed3982f2f643c7fbe8/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java#L47 2. You can try and increase the number of executors to help run bloom filter stages faster or turn on dynamicAllocation in spark and provide a min, max executors to let the bloom_index stage run faster. 3. Try and experiment with other types of indexes, such as the SimpleIndex. There is a record level index implementation underway which will ensure that the job runtime does not increase with more file ids. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
