imply-cheddar commented on code in PR #13977:
URL: https://github.com/apache/druid/pull/13977#discussion_r1148747811
##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
public interface ColumnConfig
{
int columnCacheSizeBytes();
+
+ /**
+ * If the total number of rows in a column multiplied by this value is
smaller than the total number of bitmap
+ * index operations required to perform to use a {@link
LexicographicalRangeIndex} or {@link NumericRangeIndex},
+ * then for any {@link ColumnIndexSupplier} which chooses to participate in
this config it will skip computing the
+ * index, in favor of doing a full scan and using a {@link
org.apache.druid.query.filter.ValueMatcher} instead.
+ * This is indicated returning null from {@link
ColumnIndexSupplier#as(Class)} even though it would have otherwise
+ * been able to create a {@link BitmapColumnIndex}. For range indexes on
columns where every value has an index, the
Review Comment:
I'm scared of exposing this behavior through adjusting the return of
`.as()`. I think it would be better to have the columns continue to return a
meaningful object for `.as()` but have the those interfaces that benefit from
this be able to opt themselves out of being used for bitmaps. (Or, in some
future world, return an equivalent `ValueMatcher`)
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -530,6 +533,10 @@ private ColumnHolder readNestedFieldColumn(String field)
columnBuilder.setHasMultipleValues(false)
.setHasNulls(hasNull)
.setDictionaryEncodedColumnSupplier(columnSupplier);
+
+ ColumnarInts intsColumn = ints.get();
+ final int size = intsColumn.size();
+ intsColumn.close();
Review Comment:
This type of pattern is a bit scary. In this instance given that it's when
deserializing the object, it's probably fine, but we've had issues where we
thought we were getting a thing and then closing it, only to find out that we
were getting a thing that was cached and supposed to live for the lifetime of a
query and the immediate close was closing things.
That said, I think we can make this cleaner by making the `Supplier<>`
itself be a class that has a method that knows the number of rows represented
by the thing that it supplies. It should be easy enough to implement this for
the suppliers used in this case (the number is already on the jvm heap for
those objects anyway) and this avoids any need to create an object and close it
immediately.
##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -22,4 +22,70 @@
public interface ColumnConfig
{
int columnCacheSizeBytes();
+
+ /**
+ * If the total number of rows in a column multiplied by this value is
smaller than the total number of bitmap
+ * index operations required to perform to use a {@link
LexicographicalRangeIndex} or {@link NumericRangeIndex},
+ * then for any {@link ColumnIndexSupplier} which chooses to participate in
this config it will skip computing the
+ * index, in favor of doing a full scan and using a {@link
org.apache.druid.query.filter.ValueMatcher} instead.
+ * This is indicated returning null from {@link
ColumnIndexSupplier#as(Class)} even though it would have otherwise
+ * been able to create a {@link BitmapColumnIndex}. For range indexes on
columns where every value has an index, the
+ * number of bitmap operations is determined by how many individual values
fall in the range, a subset of the columns
+ * total cardinality.
+ * <p>
+ * Currently only the {@link
org.apache.druid.segment.nested.NestedFieldLiteralColumnIndexSupplier}
implements this
+ * behavior.
+ * <p>
+ * This can make many standalone filters faster in cases where the overhead
of doing bitmap operations to construct a
+ * {@link org.apache.druid.segment.BitmapOffset} or {@link
org.apache.druid.segment.vector.BitmapVectorOffset} can
+ * often exceed the cost of just using doing a full scan and using a
+ * {@link org.apache.druid.query.filter.ValueMatcher}.
Review Comment:
I think the "many" in "This can make many standalone filters faster" is a
bit of a hyperbola. What about "Some standalone filters speed up with this
optimization in cases where the overhead of walking the dictionary and
combining bitmaps to construct ..."?
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -538,7 +545,10 @@ private ColumnHolder readNestedFieldColumn(String field)
localDictionarySupplier,
stringDictionarySupplier,
longDictionarySupplier,
- doubleDictionarySupplier
+ doubleDictionarySupplier,
+ size,
+ columnConfig.skipValueRangeIndexScale(),
+ columnConfig.skipValuePredicateIndexScale()
Review Comment:
Why not just pass the `columnConfig` object itself?
##########
benchmarks/src/test/java/org/apache/druid/benchmark/query/SqlNestedDataBenchmark.java:
##########
@@ -173,7 +173,31 @@ public String getFormatString()
"SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 1005.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
// 28, 29
"SELECT SUM(long1) FROM foo WHERE double3 < 2000.0 AND double3 > 1000.0",
- "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0"
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 2000.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 30, 31
+ "SELECT SUM(long1) FROM foo WHERE double3 < 3000.0 AND double3 > 1000.0",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 3000.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 32,33
+ "SELECT SUM(long1) FROM foo WHERE double3 < 5000.0 AND double3 > 1000.0",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) < 5000.0 AND
JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE) > 1000.0",
+ // 34,35 smaller cardinality like range filter
+ "SELECT SUM(long1) FROM foo WHERE string1 LIKE '1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '1%'",
+ // 36,37 smaller cardinality like predicate filter
+ "SELECT SUM(long1) FROM foo WHERE string1 LIKE '%1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.string1') LIKE '%1%'",
+ // 38-39 moderate cardinality like range
+ "SELECT SUM(long1) FROM foo WHERE string5 LIKE '1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '1%'",
+ // 40, 41 big cardinality lex range
+ "SELECT SUM(long1) FROM foo WHERE string5 > '1'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.string5') > '1'",
+ // 42, 43 big cardinality like predicate filter
+ "SELECT SUM(long1) FROM foo WHERE string5 LIKE '%1%'",
+ "SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo
WHERE JSON_VALUE(nested, '$.nesteder.string5') LIKE '%1%'",
Review Comment:
Can you add various sized IN clauses to your benchmark? 10, 100, 1000,
10000, 100000?
After reading the PR, I don't think they should be impacted, but it would be
good to know that they aren't as they are an example of a thing that falls back
to predicate-like behavior, but which we have seen also cause performance
issues when run as a `ValueMatcher`
##########
processing/src/main/java/org/apache/druid/segment/nested/NestedFieldLiteralColumnIndexSupplier.java:
##########
@@ -124,6 +131,10 @@ public <T> T as(Class<T> clazz)
return (T) new NestedLiteralDictionaryEncodedStringValueIndex();
}
+ if (skipPredicateIndex && clazz.equals(DruidPredicateIndex.class)) {
+ return null;
+ }
Review Comment:
Technically, this would indicate that this column cannot implement the given
the interface, so the filter in question should fall back to another interface
(i.e. it might make the filter to cascade down into trying an even worse
strategy). Instead, we should return a `DruidPredicateIndex` that returns
`null` when asked to compute the index, indicating that it was unable to
compute the index. This will make it easier to do this "correctly" when we
make the change to have this object able to also return ValueMatcher objects.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]