gianm commented on a change in pull request #10432:
URL: https://github.com/apache/druid/pull/10432#discussion_r494536534
##########
File path: docs/querying/query-context.md
##########
@@ -92,7 +92,7 @@ include "selector", "bound", "in", "like", "regex", "search",
"and", "or", and "
- All aggregators must offer vectorized implementations. These include
"count", "doubleSum", "floatSum", "longSum", "longMin",
"longMax", "doubleMin", "doubleMax", "floatMin", "floatMax", "longAny",
"doubleAny", "floatAny", "stringAny",
"hyperUnique", "filtered", "approxHistogram", "approxHistogramFold", and
"fixedBucketsHistogram" (with numerical input).
-- No virtual columns.
+- All virtual columns must offer vectorized implementations.
Review comment:
There's only one built-in virtual column: "expression". So, we should
mention when that specific type of virtual column can be vectorized. (Does it
depend on the type of expression? Types of input? Functions used? etc)
##########
File path: docs/querying/query-context.md
##########
@@ -106,3 +106,4 @@ vectorization. These query types will ignore the
"vectorize" parameter even if i
|--------|-------|------------|
|vectorize|`true`|Enables or disables vectorized query execution. Possible
values are `false` (disabled), `true` (enabled if possible, disabled otherwise,
on a per-segment basis), and `force` (enabled, and groupBy or timeseries
queries that cannot be vectorized will fail). The `"force"` setting is meant to
aid in testing, and is not generally useful in production (since real-time
segments can never be processed with vectorized execution, any queries on
real-time data will fail). This will override
`druid.query.default.context.vectorize` if it's set.|
|vectorSize|`512`|Sets the row batching size for a particular query. This will
override `druid.query.default.context.vectorSize` if it's set.|
+|vectorizeVirtualColumns|`true`|Enables or disables vectorized query
processing of queries with virtual columns, layered on top of `vectorize`
(`vectorize` must also be set to true for a query to utilize vectorization).
Possible values are `false` (disabled), `true` (enabled if possible, disabled
otherwise, on a per-segment basis), and `force` (enabled, and groupBy or
timeseries queries with virtual columns that cannot be vectorized will fail).
The `"force"` setting is meant to aid in testing, and is not generally useful
in production. This will override
`druid.query.default.context.vectorizeVirtualColumns` if it's set.|
Review comment:
IMO, it'd be best to set it to `false` now, then change it to `true` for
the next release. It'll lower the risk around upgrading, and we want upgrades
to be really low risk so people feel comfortable just dropping in the newest
version.
The specific risks here are:
1) Bugs in vectorized expressions.
2) Situations where vectorized expressions perform worse than non-vectorized
ones (perhaps some optimizations are implemented for the non-vectorized case,
but not the vectorized case).
3) Situations where there is a bug in the vectorized query engines _in
general_, but a specific user isn't hitting it, because they're using
expressions and so the query wouldn't vectorize anyway.
##########
File path:
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/vector/VectorGroupByEngine.java
##########
@@ -76,10 +76,18 @@ public static boolean canVectorize(
Function<String, ColumnCapabilities> capabilitiesFunction = name ->
query.getVirtualColumns().getColumnCapabilitiesWithFallback(adapter,
name);
+ final boolean adapterCanVectorize = adapter.canVectorize(filter,
query.getVirtualColumns(), false);
+ final boolean virtualColumnsCanVectorize;
+ if (query.getVirtualColumns().getVirtualColumns().length > 0) {
Review comment:
There's some duplication here between the query engines — how about
creating a helper that takes the query and the virtual columns and creates a
`Vectorize` instance that synthesizes both?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]