[GitHub] [druid] gianm commented on a change in pull request #10432: add vectorizeVirtualColumns query context parameter

GitBox Thu, 24 Sep 2020 11:52:55 -0700


gianm commented on a change in pull request #10432:
URL: https://github.com/apache/druid/pull/10432#discussion_r494536534




##########
File path: docs/querying/query-context.md
##########
@@ -92,7 +92,7 @@ include "selector", "bound", "in", "like", "regex", "search", 
"and", "or", and "
 - All aggregators must offer vectorized implementations. These include 
"count", "doubleSum", "floatSum", "longSum", "longMin",
  "longMax", "doubleMin", "doubleMax", "floatMin", "floatMax", "longAny", 
"doubleAny", "floatAny", "stringAny",
  "hyperUnique", "filtered", "approxHistogram", "approxHistogramFold", and 
"fixedBucketsHistogram" (with numerical input). 
-- No virtual columns.
+- All virtual columns must offer vectorized implementations.

Review comment:
       There's only one built-in virtual column: "expression". So, we should 
mention when that specific type of virtual column can be vectorized. (Does it 
depend on the type of expression? Types of input? Functions used? etc)

##########
File path: docs/querying/query-context.md
##########
@@ -106,3 +106,4 @@ vectorization. These query types will ignore the 
"vectorize" parameter even if i
 |--------|-------|------------|
 |vectorize|`true`|Enables or disables vectorized query execution. Possible 
values are `false` (disabled), `true` (enabled if possible, disabled otherwise, 
on a per-segment basis), and `force` (enabled, and groupBy or timeseries 
queries that cannot be vectorized will fail). The `"force"` setting is meant to 
aid in testing, and is not generally useful in production (since real-time 
segments can never be processed with vectorized execution, any queries on 
real-time data will fail). This will override 
`druid.query.default.context.vectorize` if it's set.|
 |vectorSize|`512`|Sets the row batching size for a particular query. This will 
override `druid.query.default.context.vectorSize` if it's set.|
+|vectorizeVirtualColumns|`true`|Enables or disables vectorized query 
processing of queries with virtual columns, layered on top of `vectorize` 
(`vectorize` must also be set to true for a query to utilize vectorization). 
Possible values are `false` (disabled), `true` (enabled if possible, disabled 
otherwise, on a per-segment basis), and `force` (enabled, and groupBy or 
timeseries queries with virtual columns that cannot be vectorized will fail). 
The `"force"` setting is meant to aid in testing, and is not generally useful 
in production. This will override 
`druid.query.default.context.vectorizeVirtualColumns` if it's set.|

Review comment:
       IMO, it'd be best to set it to `false` now, then change it to `true` for 
the next release. It'll lower the risk around upgrading, and we want upgrades 
to be really low risk so people feel comfortable just dropping in the newest 
version.
   
   The specific risks here are:
   
   1) Bugs in vectorized expressions.
   2) Situations where vectorized expressions perform worse than non-vectorized 
ones (perhaps some optimizations are implemented for the non-vectorized case, 
but not the vectorized case).
   3) Situations where there is a bug in the vectorized query engines _in 
general_, but a specific user isn't hitting it, because they're using 
expressions and so the query wouldn't vectorize anyway.

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/vector/VectorGroupByEngine.java
##########
@@ -76,10 +76,18 @@ public static boolean canVectorize(
     Function<String, ColumnCapabilities> capabilitiesFunction = name ->
         query.getVirtualColumns().getColumnCapabilitiesWithFallback(adapter, 
name);
 
+    final boolean adapterCanVectorize = adapter.canVectorize(filter, 
query.getVirtualColumns(), false);
+    final boolean virtualColumnsCanVectorize;
+    if (query.getVirtualColumns().getVirtualColumns().length > 0) {

Review comment:
       There's some duplication here between the query engines — how about 
creating a helper that takes the query and the virtual columns and creates a 
`Vectorize` instance that synthesizes both?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm commented on a change in pull request #10432: add vectorizeVirtualColumns query context parameter

Reply via email to