[GitHub] [hudi] kimberlyamandalu commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

GitBox Thu, 08 Apr 2021 10:02:47 -0700


kimberlyamandalu commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-815987341



   I have a similar issue where bloom index performance is very slow for upsert 
into a Hudi MOR table.
   Does anyone know if when Hudi performs an upsert, does it only lookup index 
for the related partitions or does it lookup against the entire data set? I 
have partitions of year and month from 1998 to 2020. My upserts are mostly to 
recent partitions (95%). I also notice a lot of calls to build fs view for 
older partitions i know should not have any upserts
   
   `AbstractTableFileSystemView: Building file system view for partition 
(message_year=2002/message_month=9)`
   
   
![image](https://user-images.githubusercontent.com/25435575/114066282-90027400-9869-11eb-8828-f9615f828d7e.png)
   
   
   Obtain key ranges for file slices (range pruning=on)
   collect at HoodieSparkEngineContext.java:73+details
   org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
   
org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:73)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.loadInvolvedFiles(SparkHoodieBloomIndex.java:176)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.lookupIndex(SparkHoodieBloomIndex.java:119)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:84)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:60)
   
org.apache.hudi.table.action.commit.AbstractWriteHelper.tag(AbstractWriteHelper.java:69)
   
org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:51)
   
org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:46)
   
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:82)
   
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:74)
   
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:146)
   org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)
   org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:181)
   org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
   
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kimberlyamandalu commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

Reply via email to