kimberlyamandalu commented on issue #2620: URL: https://github.com/apache/hudi/issues/2620#issuecomment-815987341
I have a similar issue where bloom index performance is very slow for upsert into a Hudi MOR table. Does anyone know if when Hudi performs an upsert, does it only lookup index for the related partitions or does it lookup against the entire data set? I have partitions of year and month from 1998 to 2020. My upserts are mostly to recent partitions (95%). I also notice a lot of calls to build fs view for older partitions i know should not have any upserts `AbstractTableFileSystemView: Building file system view for partition (message_year=2002/message_month=9)`  Obtain key ranges for file slices (range pruning=on) collect at HoodieSparkEngineContext.java:73+details org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:73) org.apache.hudi.index.bloom.SparkHoodieBloomIndex.loadInvolvedFiles(SparkHoodieBloomIndex.java:176) org.apache.hudi.index.bloom.SparkHoodieBloomIndex.lookupIndex(SparkHoodieBloomIndex.java:119) org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:84) org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:60) org.apache.hudi.table.action.commit.AbstractWriteHelper.tag(AbstractWriteHelper.java:69) org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:51) org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:46) org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:82) org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:74) org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:146) org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214) org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:181) org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134) org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
