nsivabalan commented on code in PR #13563:
URL: https://github.com/apache/hudi/pull/13563#discussion_r2217009545
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -455,6 +456,45 @@ public final class HoodieMetadataConfig extends
HoodieConfig {
+ "The index name either starts with or matches exactly can be one
of the following: "
+
StringUtils.join(Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toList()),
", "));
+ // Configs that control the bloom filter that is written to the file footer
+ public static final ConfigProperty<String> BLOOM_FILTER_TYPE = ConfigProperty
+ .key(String.format("%s.%s", METADATA_PREFIX, "bloom.index.filter.type"))
+ .defaultValue(BloomFilterTypeCode.DYNAMIC_V0.name())
+ .withValidValues(BloomFilterTypeCode.SIMPLE.name(),
BloomFilterTypeCode.DYNAMIC_V0.name())
+ .markAdvanced()
+ .withDocumentation(BloomFilterTypeCode.class);
+
+ public static final ConfigProperty<String> BLOOM_FILTER_NUM_ENTRIES_VALUE =
ConfigProperty
+ .key(String.format("%s.%s", METADATA_PREFIX, "index.bloom.num_entries"))
+ .defaultValue("60000")
Review Comment:
should we reduce this default value. this was chosen based on data files.
in MDT, each partition could have diff avg no of entries.
for eg, for FILES, somewhere around 1000.
for col stats, expression index and bloom index, may be 30000
and for RLI: 100k.
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -455,6 +456,45 @@ public final class HoodieMetadataConfig extends
HoodieConfig {
+ "The index name either starts with or matches exactly can be one
of the following: "
+
StringUtils.join(Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toList()),
", "));
+ // Configs that control the bloom filter that is written to the file footer
+ public static final ConfigProperty<String> BLOOM_FILTER_TYPE = ConfigProperty
+ .key(String.format("%s.%s", METADATA_PREFIX, "bloom.index.filter.type"))
+ .defaultValue(BloomFilterTypeCode.DYNAMIC_V0.name())
+ .withValidValues(BloomFilterTypeCode.SIMPLE.name(),
BloomFilterTypeCode.DYNAMIC_V0.name())
+ .markAdvanced()
+ .withDocumentation(BloomFilterTypeCode.class);
+
+ public static final ConfigProperty<String> BLOOM_FILTER_NUM_ENTRIES_VALUE =
ConfigProperty
+ .key(String.format("%s.%s", METADATA_PREFIX, "index.bloom.num_entries"))
+ .defaultValue("60000")
+ .markAdvanced()
+ .withDocumentation("Only applies if index type is BLOOM. "
+ + "This is the number of entries to be stored in
the bloom filter. "
+ + "The rationale for the default: Assume the
maxParquetFileSize is 128MB and averageRecordSize is 1kb and "
+ + "hence we approx a total of 130K records in a
file. The default (60000) is roughly half of this approximation. "
Review Comment:
we need to fix the documentation to account for MDT
##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java:
##########
@@ -421,17 +436,19 @@ private static class RecordByKeyIterator implements
ClosableIterator<IndexedReco
private final Schema readerSchema;
private final Schema writerSchema;
+ private final Option<BloomFilter> bloomFilterOption;
private IndexedRecord next = null;
RecordByKeyIterator(HFileReader reader, List<String> sortedKeys, Schema
writerSchema,
- Schema readerSchema) throws IOException {
+ Schema readerSchema, Option<BloomFilter>
bloomFilterOption) throws IOException {
Review Comment:
do you think we can change this to generic Predicate argument
##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieNativeAvroHFileReader.java:
##########
@@ -338,10 +339,24 @@ private HFileReader newHFileReader() throws IOException {
return new HFileReaderImpl(inputStream, fileSize);
}
+ private Option<BloomFilter> getBloomFilter() {
+ Option<BloomFilter> bloomFilter = Option.empty();
+ if (config.enabled()) {
Review Comment:
for an existing table, there are chances older files may not have the bloom
filter, but only the new ones will have right. So, we can't rely on write
properties
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -455,6 +456,45 @@ public final class HoodieMetadataConfig extends
HoodieConfig {
+ "The index name either starts with or matches exactly can be one
of the following: "
+
StringUtils.join(Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toList()),
", "));
+ // Configs that control the bloom filter that is written to the file footer
+ public static final ConfigProperty<String> BLOOM_FILTER_TYPE = ConfigProperty
+ .key(String.format("%s.%s", METADATA_PREFIX, "bloom.index.filter.type"))
+ .defaultValue(BloomFilterTypeCode.DYNAMIC_V0.name())
+ .withValidValues(BloomFilterTypeCode.SIMPLE.name(),
BloomFilterTypeCode.DYNAMIC_V0.name())
+ .markAdvanced()
+ .withDocumentation(BloomFilterTypeCode.class);
+
+ public static final ConfigProperty<String> BLOOM_FILTER_NUM_ENTRIES_VALUE =
ConfigProperty
+ .key(String.format("%s.%s", METADATA_PREFIX, "index.bloom.num_entries"))
+ .defaultValue("60000")
+ .markAdvanced()
+ .withDocumentation("Only applies if index type is BLOOM. "
+ + "This is the number of entries to be stored in
the bloom filter. "
+ + "The rationale for the default: Assume the
maxParquetFileSize is 128MB and averageRecordSize is 1kb and "
+ + "hence we approx a total of 130K records in a
file. The default (60000) is roughly half of this approximation. "
+ + "Warning: Setting this very low, will generate
a lot of false positives and index lookup "
+ + "will have to scan a lot more files than it has
to and setting this to a very high number will "
+ + "increase the size every base file linearly
(roughly 4KB for every 50000 entries). "
+ + "This config is also used with DYNAMIC bloom
filter which determines the initial size for the bloom.");
+
+ public static final ConfigProperty<String> BLOOM_FILTER_FPP_VALUE =
ConfigProperty
+ .key(String.format("%s.%s", METADATA_PREFIX, "index.bloom.fpp"))
+ .defaultValue("0.000000001")
+ .markAdvanced()
+ .withDocumentation("Only applies if index type is BLOOM. "
+ + "Error rate allowed given the number of
entries. This is used to calculate how many bits should be "
+ + "assigned for the bloom filter and the number
of hash functions. This is usually set very low (default: 0.000000001), "
+ + "we like to tradeoff disk space for lower false
positives. "
+ + "If the number of entries added to bloom filter
exceeds the configured value (hoodie.index.bloom.num_entries), "
+ + "then this fpp may not be honored.");
+
+ public static final ConfigProperty<String> BLOOM_FILTER_DYNAMIC_MAX_ENTRIES
= ConfigProperty
+ .key(String.format("%s.%s", METADATA_PREFIX,
"bloom.index.filter.dynamic.max.entries"))
+ .defaultValue("100000")
Review Comment:
instead we can leverage the dynamic bloom filter.
we can start w/ 10k. and go upto 100k. so that we do not need to configure
diff values for diff MDT partitions
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]