nsivabalan commented on code in PR #9188:
URL: https://github.com/apache/hudi/pull/9188#discussion_r1264147849
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java:
##########
@@ -103,11 +102,6 @@ record -> new ImmutablePair<>(record.getPartitionPath(),
record.getRecordKey()))
// Step 3: Tag the incoming records, as inserts or updates, by joining
with existing record keys
HoodieData<HoodieRecord<R>> taggedRecords =
tagLocationBacktoRecords(keyFilenamePairs, records, hoodieTable);
- if (config.getBloomIndexUseCaching()) {
Review Comment:
I guess this was intentional. After this, taggedRecords is what is getting
used. and we do cache that in BaseSparkCommitActionExecutor.execute
```
@Override
public HoodieWriteMetadata<HoodieData<WriteStatus>>
execute(HoodieData<HoodieRecord<T>> inputRecords) {
// Cache the tagged records, so we don't end up computing both
JavaRDD<HoodieRecord<T>> inputRDD =
HoodieJavaRDD.getJavaRDD(inputRecords);
if (inputRDD.getStorageLevel() == StorageLevel.NONE()) {
HoodieJavaRDD.of(inputRDD).persist(config.getTaggedRecordStorageLevel(),
context, HoodieDataCacheKey.of(config.getBasePath(), instantTime));
} else {
LOG.info("RDD PreppedRecords was persisted at: " +
inputRDD.getStorageLevel());
}
.
.
```
So, not sure if we want to keep the persistance until the very end for these
rdds which may not be used only.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##########
@@ -80,7 +81,7 @@ public O updateLocation(O writeStatuses, HoodieEngineContext
context,
@PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
public abstract <R> HoodieData<HoodieRecord<R>> tagLocation(
HoodieData<HoodieRecord<R>> records, HoodieEngineContext context,
- HoodieTable hoodieTable) throws HoodieIndexException;
+ HoodieTable hoodieTable, Option<String> instantTime) throws
HoodieIndexException;
Review Comment:
this is a public api. we might have to deprecate and add a new one if we
wish to change the signature
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java:
##########
@@ -75,11 +74,11 @@ public HoodieBloomIndex(HoodieWriteConfig config,
BaseHoodieBloomIndexHelper blo
@Override
public <R> HoodieData<HoodieRecord<R>> tagLocation(
HoodieData<HoodieRecord<R>> records, HoodieEngineContext context,
- HoodieTable hoodieTable) {
+ HoodieTable hoodieTable, Option<String> instantTime) {
// Step 0: cache the input records if needed
- if (config.getBloomIndexUseCaching()) {
- records.persist(new HoodieConfig(config.getProps())
- .getString(HoodieIndexConfig.BLOOM_INDEX_INPUT_STORAGE_LEVEL_VALUE));
+ if (config.getBloomIndexUseCaching() && instantTime.isPresent()) {
+ String storageLevel =
config.getString(HoodieIndexConfig.BLOOM_INDEX_INPUT_STORAGE_LEVEL_VALUE);
Review Comment:
can we move this to constructor and use it everywhere instead of parsing
multiple times?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]