Jackie-Jiang commented on code in PR #8766:
URL: https://github.com/apache/pinot/pull/8766#discussion_r881989071
##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -190,6 +201,33 @@ private void extractPredicateColumns(FilterContext filter,
Set<String> eqInColum
String column = lhs.getIdentifier();
Predicate.Type predicateType = predicate.getType();
+ switch (predicateType) {
+ case EQ: {
+ eqInColumns.add(column);
+ cachedValues.put(predicate, new CachedValue(((EqPredicate)
predicate).getValue()));
+ break;
+ }
+ case IN: {
+ InPredicate inPredicate = (InPredicate) predicate;
+ if (inPredicate.getValues().size() > _inPredicateThreshold) {
+ break;
+ }
+ eqInColumns.add(column);
+ List<CachedValue> list = new
ArrayList<>(inPredicate.getValues().size());
+ for (String value : inPredicate.getValues()) {
+ list.add(new CachedValue(value));
+ }
+ cachedValues.put(predicate, list);
+ break;
+ }
+ case RANGE: {
+ rangeColumns.add(column);
+ break;
+ }
+ default: {
+ break;
+ }
+ }
if (predicateType == Predicate.Type.EQ || (predicateType ==
Predicate.Type.IN
Review Comment:
(MAJOR) We are adding the columns twice
##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -190,6 +201,33 @@ private void extractPredicateColumns(FilterContext filter,
Set<String> eqInColum
String column = lhs.getIdentifier();
Predicate.Type predicateType = predicate.getType();
+ switch (predicateType) {
+ case EQ: {
+ eqInColumns.add(column);
+ cachedValues.put(predicate, new CachedValue(((EqPredicate)
predicate).getValue()));
Review Comment:
(MAJOR) There is also correctness issue. We need to calculate hash on the
converted value instead of the value in the predicate. For example, when the
column is of data type `DOUBLE`, and the predicate is `col = 1`, we need to
calculate hash on `"1.0"` instead of `"1"`.
##########
pinot-core/src/test/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPrunerTest.java:
##########
@@ -148,17 +158,18 @@ public void testBloomFilterInPredicatePruning() {
when(indexSegment.getDataSource("column")).thenReturn(dataSource);
// Add support for bloom filter
DataSourceMetadata dataSourceMetadata = mock(DataSourceMetadata.class);
- BloomFilterReader bloomFilterReader = mock(BloomFilterReader.class);
+ BloomFilterReader bloomFilterReader = new BloomFilterReaderBuilder()
+ .put("1")
Review Comment:
(nit) Make it one line
##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -190,6 +201,33 @@ private void extractPredicateColumns(FilterContext filter,
Set<String> eqInColum
String column = lhs.getIdentifier();
Predicate.Type predicateType = predicate.getType();
+ switch (predicateType) {
+ case EQ: {
+ eqInColumns.add(column);
+ cachedValues.put(predicate, new CachedValue(((EqPredicate)
predicate).getValue()));
Review Comment:
This can cause unnecessary overhead when bloom filter is not available for
the column. We should still update the map lazily when applying the bloom filter
##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -91,23 +95,28 @@ public List<IndexSegment> prune(List<IndexSegment>
segments, QueryContext query)
// Extract EQ/IN/RANGE predicate columns
Set<String> eqInColumns = new HashSet<>();
Set<String> rangeColumns = new HashSet<>();
- extractPredicateColumns(filter, eqInColumns, rangeColumns);
+ // As Predicates are recursive structures, their hashCode is quite
expensive.
+ // By using an IdentityHashMap here we don't need to iterate over the
recursive
+ // structure. This is specially useful in the IN expression.
+ Map<Predicate, Object> cachedValues = new IdentityHashMap<>();
+ extractPredicateColumns(filter, eqInColumns, rangeColumns, cachedValues);
if (eqInColumns.isEmpty() && rangeColumns.isEmpty()) {
return segments;
}
int numSegments = segments.size();
List<IndexSegment> selectedSegments = new ArrayList<>(numSegments);
+
if (!eqInColumns.isEmpty() && query.isEnablePrefetch()) {
Map[] dataSourceCaches = new Map[numSegments];
FetchContext[] fetchContexts = new FetchContext[numSegments];
try {
// Prefetch bloom filter for columns within the EQ/IN predicate if
exists
for (int i = 0; i < numSegments; i++) {
IndexSegment segment = segments.get(i);
- Map<String, DataSource> dataSourceCache = new HashMap<>();
- Map<String, List<ColumnIndexType>> columnToIndexList = new
HashMap<>();
+ Map<String, DataSource> dataSourceCache = new
HashMap<>(eqInColumns.size());
Review Comment:
The parameter here is the initial capacity without regards to the load
factor. So if we insert this number of entries to the map, very likely it needs
to do a resize which is undesired. Also, if there are very few columns,
initializing a very small map can increase the chance of hash collision, and
cause worse performance.
I'd suggest just using the empty constructor, or we do have util methods
`HashUtil.getHashMapCapacity()` and `HashUtil.getMinHashSetSize()` to create
the capacity with regard to the load factor (IMO that is not necessary, the
default initial capacity of 16 should be good for most scenarios)
##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -435,4 +482,27 @@ private static Comparable convertValue(String stringValue,
DataType dataType) {
throw new BadQueryRequestException(e);
}
}
+
+ private static class CachedValue {
+ private final String _strValue;
+ private final long _hash1;
+ private final long _hash2;
+ private DataType _dt;
+ private Comparable _comparableValue;
+
+ private CachedValue(Object value) {
+ _strValue = value.toString();
+ byte[] hash = GuavaBloomFilterReaderUtils.hash(_strValue);
+ _hash1 = Longs.fromBytes(hash[7], hash[6], hash[5], hash[4], hash[3],
hash[2], hash[1], hash[0]);
+ _hash2 = Longs.fromBytes(hash[15], hash[14], hash[13], hash[12],
hash[11], hash[10], hash[9], hash[8]);
+ }
+
+ private Comparable getComparableValue(DataType dt) {
Review Comment:
Here we made an assumption that all the segments follow the same schema (the
same column always have the same data type across all segments), which might
not always be the case. In some corner cases, different segment might be
generated with different schema.
We may cache a data type here, and check if the data type matches before
pruning the segment. If the data type does not match, skip pruning the segment.
This could cause false negative for the pruning, but should be very rare
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]