[GitHub] [hudi] codejoyan edited a comment on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

GitBox Sat, 06 Mar 2021 08:19:45 -0800


codejoyan edited a comment on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-791961553



   Thanks @bvaradar and @nsivabalan. Please let me know how to improve the 
performance or if you need any further details to investigate.
   I used the below configurations (SIMPLE INDEX and turned off compaction)  to 
speed up the inserts and see much improvement:
   hoodie.parquet.small.file.limit 0
   hoodie.index.type SIMPLE
   
   But what are the downsides of not using the DEFAULT Bloom filter. In my 
use-case I would have late arriving data, so will the performance suffer 
because of this choice?
   
   Also I would like to understand why these specific steps are taking time. 
From Spark web-UI it seems the execution of the below methods are taking too 
long. Any insights to understand what is happening in the background please?
   
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.findMatchingFilesForRecordKeys(SparkHoodieBloomIndex.java:266)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocationBacktoRecords(SparkHoodieBloomIndex.java:287)
   
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:433)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codejoyan edited a comment on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

Reply via email to