[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8766: Optimize ColumnValueSegmentPruner by caching value hashes

GitBox Wed, 25 May 2022 11:51:49 -0700


Jackie-Jiang commented on code in PR #8766:
URL: https://github.com/apache/pinot/pull/8766#discussion_r881989071



##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -190,6 +201,33 @@ private void extractPredicateColumns(FilterContext filter, 
Set<String> eqInColum
         String column = lhs.getIdentifier();
 
         Predicate.Type predicateType = predicate.getType();
+        switch (predicateType) {
+          case EQ: {
+            eqInColumns.add(column);
+            cachedValues.put(predicate, new CachedValue(((EqPredicate) 
predicate).getValue()));
+            break;
+          }
+          case IN: {
+            InPredicate inPredicate = (InPredicate) predicate;
+            if (inPredicate.getValues().size() > _inPredicateThreshold) {
+              break;
+            }
+            eqInColumns.add(column);
+            List<CachedValue> list = new 
ArrayList<>(inPredicate.getValues().size());
+            for (String value : inPredicate.getValues()) {
+              list.add(new CachedValue(value));
+            }
+            cachedValues.put(predicate, list);
+            break;
+          }
+          case RANGE: {
+            rangeColumns.add(column);
+            break;
+          }
+          default: {
+            break;
+          }
+        }
         if (predicateType == Predicate.Type.EQ || (predicateType == 
Predicate.Type.IN

Review Comment:
   (MAJOR) We are adding the columns twice



##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -190,6 +201,33 @@ private void extractPredicateColumns(FilterContext filter, 
Set<String> eqInColum
         String column = lhs.getIdentifier();
 
         Predicate.Type predicateType = predicate.getType();
+        switch (predicateType) {
+          case EQ: {
+            eqInColumns.add(column);
+            cachedValues.put(predicate, new CachedValue(((EqPredicate) 
predicate).getValue()));

Review Comment:
   (MAJOR) There is also correctness issue. We need to calculate hash on the 
converted value instead of the value in the predicate. For example, when the 
column is of data type `DOUBLE`, and the predicate is `col = 1`, we need to 
calculate hash on `"1.0"` instead of `"1"`.



##########
pinot-core/src/test/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPrunerTest.java:
##########
@@ -148,17 +158,18 @@ public void testBloomFilterInPredicatePruning() {
     when(indexSegment.getDataSource("column")).thenReturn(dataSource);
     // Add support for bloom filter
     DataSourceMetadata dataSourceMetadata = mock(DataSourceMetadata.class);
-    BloomFilterReader bloomFilterReader = mock(BloomFilterReader.class);
+    BloomFilterReader bloomFilterReader = new BloomFilterReaderBuilder()
+        .put("1")

Review Comment:
   (nit) Make it one line



##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -190,6 +201,33 @@ private void extractPredicateColumns(FilterContext filter, 
Set<String> eqInColum
         String column = lhs.getIdentifier();
 
         Predicate.Type predicateType = predicate.getType();
+        switch (predicateType) {
+          case EQ: {
+            eqInColumns.add(column);
+            cachedValues.put(predicate, new CachedValue(((EqPredicate) 
predicate).getValue()));

Review Comment:
   This can cause unnecessary overhead when bloom filter is not available for 
the column. We should still update the map lazily when applying the bloom filter



##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -91,23 +95,28 @@ public List<IndexSegment> prune(List<IndexSegment> 
segments, QueryContext query)
     // Extract EQ/IN/RANGE predicate columns
     Set<String> eqInColumns = new HashSet<>();
     Set<String> rangeColumns = new HashSet<>();
-    extractPredicateColumns(filter, eqInColumns, rangeColumns);
+    // As Predicates are recursive structures, their hashCode is quite 
expensive.
+    // By using an IdentityHashMap here we don't need to iterate over the 
recursive
+    // structure. This is specially useful in the IN expression.
+    Map<Predicate, Object> cachedValues = new IdentityHashMap<>();
+    extractPredicateColumns(filter, eqInColumns, rangeColumns, cachedValues);
 
     if (eqInColumns.isEmpty() && rangeColumns.isEmpty()) {
       return segments;
     }
 
     int numSegments = segments.size();
     List<IndexSegment> selectedSegments = new ArrayList<>(numSegments);
+
     if (!eqInColumns.isEmpty() && query.isEnablePrefetch()) {
       Map[] dataSourceCaches = new Map[numSegments];
       FetchContext[] fetchContexts = new FetchContext[numSegments];
       try {
         // Prefetch bloom filter for columns within the EQ/IN predicate if 
exists
         for (int i = 0; i < numSegments; i++) {
           IndexSegment segment = segments.get(i);
-          Map<String, DataSource> dataSourceCache = new HashMap<>();
-          Map<String, List<ColumnIndexType>> columnToIndexList = new 
HashMap<>();
+          Map<String, DataSource> dataSourceCache = new 
HashMap<>(eqInColumns.size());

Review Comment:
   The parameter here is the initial capacity without regards to the load 
factor. So if we insert this number of entries to the map, very likely it needs 
to do a resize which is undesired. Also, if there are very few columns, 
initializing a very small map can increase the chance of hash collision, and 
cause worse performance.
   I'd suggest just using the empty constructor, or we do have util methods 
`HashUtil.getHashMapCapacity()` and `HashUtil.getMinHashSetSize()` to create 
the capacity with regard to the load factor (IMO that is not necessary, the 
default initial capacity of 16 should be good for most scenarios)



##########
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ColumnValueSegmentPruner.java:
##########
@@ -435,4 +482,27 @@ private static Comparable convertValue(String stringValue, 
DataType dataType) {
       throw new BadQueryRequestException(e);
     }
   }
+
+  private static class CachedValue {
+    private final String _strValue;
+    private final long _hash1;
+    private final long _hash2;
+    private DataType _dt;
+    private Comparable _comparableValue;
+
+    private CachedValue(Object value) {
+      _strValue = value.toString();
+      byte[] hash = GuavaBloomFilterReaderUtils.hash(_strValue);
+      _hash1 = Longs.fromBytes(hash[7], hash[6], hash[5], hash[4], hash[3], 
hash[2], hash[1], hash[0]);
+      _hash2 = Longs.fromBytes(hash[15], hash[14], hash[13], hash[12], 
hash[11], hash[10], hash[9], hash[8]);
+    }
+
+    private Comparable getComparableValue(DataType dt) {

Review Comment:
   Here we made an assumption that all the segments follow the same schema (the 
same column always have the same data type across all segments), which might 
not always be the case. In some corner cases, different segment might be 
generated with different schema.
   We may cache a data type here, and check if the data type matches before 
pruning the segment. If the data type does not match, skip pruning the segment. 
This could cause false negative for the pruning, but should be very rare



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8766: Optimize ColumnValueSegmentPruner by caching value hashes

Reply via email to