[
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720922#comment-17720922
]
ASF GitHub Bot commented on PARQUET-2254:
-----------------------------------------
yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188564979
##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##########
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bloomfilter;
+
+import static
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.LOWER_BOUND_BYTES;
+import static
org.apache.parquet.column.values.bloomfilter.BlockSplitBloomFilter.UPPER_BOUND_BYTES;
+
+import java.io.IOException;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.Optional;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.io.api.Binary;
+
+/**
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter`
as candidates and inserts values in
+ * the candidates at the same time.
+ * The purpose of this is to finally generate a bloom filter with the optimal
bit size according to the number
+ * of real data distinct values. Use the largest bloom filter as an
approximate deduplication counter, and then
+ * remove incapable bloom filter candidate during data insertion.
+ */
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
+
+ private static final Logger LOG =
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
+
+ // multiple candidates, inserting data at the same time. If the distinct
values are greater than the
+ // expected NDV of candidates, it will be removed. Finally we will choose
the smallest candidate to write out.
+ private final List<BloomFilterCandidate> candidates = new ArrayList<>();
+
+ // the largest among candidates and used as an approximate deduplication
counter
+ private BloomFilterCandidate largestCandidate;
+
+ // the accumulator of the number of distinct values that have been inserted
so far
+ private long distinctValueCounter = 0;
+
+ // indicates that the bloom filter candidate has been written out and new
data should be no longer allowed to be inserted
+ private boolean finalized = false;
+
+ // indicates the step size to find the NDV value corresponding to numBytes
+ private static final int NDV_STEP = 500;
+ private int maximumBytes = UPPER_BOUND_BYTES;
+ private int minimumBytes = LOWER_BOUND_BYTES;
+ // the hash strategy used in this bloom filter.
+ private final HashStrategy hashStrategy;
+ // the column to build bloom filter
+ private ColumnDescriptor column;
+
+ /**
+ * Given the maximum acceptable bytes size of bloom filter, generate
candidates according it.
+ *
+ * @param maximumBytes the maximum bit size of candidate
+ * @param numCandidates the number of candidates
+ * @param fpp the false positive probability
+ */
+ public AdaptiveBlockSplitBloomFilter(int maximumBytes, int numCandidates,
double fpp, ColumnDescriptor column) {
+ this(maximumBytes, HashStrategy.XXH64, fpp, numCandidates, column);
+ }
+
+ public AdaptiveBlockSplitBloomFilter(int maximumBytes, HashStrategy
hashStrategy, double fpp,
+ int numCandidates, ColumnDescriptor column) {
+ this.column = column;
+ switch (hashStrategy) {
+ case XXH64:
+ this.hashStrategy = hashStrategy;
+ break;
+ default:
+ throw new RuntimeException("Unsupported hash strategy");
+ }
+ initCandidates(maximumBytes, numCandidates, fpp);
+ }
+
+ /**
+ * Given the maximum acceptable bytes size of bloom filter, generate
candidates according
+ * to the bytes size. Because the bytes size of the candidate need to be a
+ * power of 2, we setting the candidate size according to `maxBytes` of
`1/2`, `1/4`, `1/8`, etc.
+ *
+ * @param maxBytes the maximum bit size of candidate
+ * @param numCandidates the number of candidates
+ * @param fpp the false positive probability
+ */
+ private void initCandidates(int maxBytes, int numCandidates, double fpp) {
+ int candidateByteSize = calculateBoundedPowerOfTwo(maxBytes);
+ for (int i = 1; i <= numCandidates; i++) {
+ int candidateExpectedNDV = expectedNDV(candidateByteSize, fpp);
+ // `candidateByteSize` is too small, just drop it
+ if (candidateExpectedNDV <= 0) {
+ break;
+ }
+ BloomFilterCandidate candidate =
+ new BloomFilterCandidate(candidateExpectedNDV, candidateByteSize,
minimumBytes, maximumBytes, hashStrategy);
+ candidates.add(candidate);
+ candidateByteSize = calculateBoundedPowerOfTwo(candidateByteSize / 2);
+ }
+ Optional<BloomFilterCandidate> maxBloomFilter =
candidates.stream().max(BloomFilterCandidate::compareTo);
+ if (maxBloomFilter.isPresent()) {
+ largestCandidate = maxBloomFilter.get();
+ } else {
+ throw new IllegalArgumentException("`maximumBytes` is too small to
create one valid bloom filter");
Review Comment:
I agree with you, this shouldn't be a fatal error.
However, if follow the ordinary bloom filter, no matter how small the bytes
setting is, there will be a lower bound bloom filter generation (32 bytes
size). So I modified it to have at least one minimum BloomFilter(32 bytes size)
generation to align with the original implementation. How do you feel?
> Build a BloomFilter with a more precise size
> --------------------------------------------
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Assignee: Mars
> Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct
> values), and then build BloomFilter. In general scenarios, it is actually not
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as
> candidates and inserts values in the candidates at the same time. Use the
> largest bloom filter as an approximate deduplication counter, and then remove
> incapable bloom filter candidates during data insertion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)