codejoyan edited a comment on issue #2620: URL: https://github.com/apache/hudi/issues/2620#issuecomment-791961553
Thanks @bvaradar and @nsivabalan. Please let me know how to improve the performance or if you need any further details to investigate. I used the below configurations (SIMPLE INDEX and turned off compaction) to speed up the inserts and see much improvement: hoodie.parquet.small.file.limit 0 hoodie.index.type SIMPLE But what are the downsides of not using the DEFAULT Bloom filter. In my use-case I would have late arriving data, so will the performance suffer because of this choice? Also I would like to understand why these specific steps are taking time. From Spark web-UI it seems the execution of the below methods are taking too long. Any insights to understand what is happening in the background please? org.apache.hudi.index.bloom.SparkHoodieBloomIndex.findMatchingFilesForRecordKeys(SparkHoodieBloomIndex.java:266) org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocationBacktoRecords(SparkHoodieBloomIndex.java:287) org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:433) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
