[GitHub] [druid] clintropolis commented on a change in pull request #10613: expression filter support for vectorized query engines

GitBox Sat, 09 Jan 2021 19:57:14 -0800


clintropolis commented on a change in pull request #10613:
URL: https://github.com/apache/druid/pull/10613#discussion_r553814648




##########
File path: 
processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
##########
@@ -305,17 +305,29 @@ private static ColumnCapabilities 
getEffectiveCapabilities(
         originalCapabilities == null || 
ValueType.COMPLEX.equals(originalCapabilities.getType());
 
     if (type == ValueType.STRING) {
-      if (!forceSingleValue && 
effectiveCapabilites.hasMultipleValues().isMaybeTrue()) {
-        return strategyFactory.makeMultiValueDimensionProcessor(
-            effectiveCapabilites,
-            selectorFactory.makeMultiValueDimensionSelector(dimensionSpec)
-        );
-      } else {
-        return strategyFactory.makeSingleValueDimensionProcessor(
-            effectiveCapabilites,
-            selectorFactory.makeSingleValueDimensionSelector(dimensionSpec)
-        );
+      if (!forceSingleValue) {
+        // if column is not dictionary encoded (and not non-existent or 
complex), use object selector
+        if (effectiveCapabilites.isDictionaryEncoded().isFalse() ||
+            effectiveCapabilites.areDictionaryValuesUnique().isFalse()

Review comment:
       by regular string selector do you mean dimension selector? The answer is 
yes we _could_, but if the dictionary values are not unique, i came to the 
conclusion that they aren't especially useful since in all the cases I can 
think of you can't use them as is and you're going to have to lookup the string 
values anyway (e.g. filter matching and vector group by).
   
   I was trying to provide an avenue to avoid the extra steps with expression 
virtual columns, and allow an object selector to be created directly since the 
string is the useable part in this case, and the dictionary ids just feel like 
an obstruction. But maybe this is overly broad and there are cases where we 
would want it to act like a dictionary encoded column for some reason that I 
have missed?

##########
File path: 
processing/src/main/java/org/apache/druid/segment/filter/ExpressionFilter.java
##########
@@ -55,6 +69,105 @@ public ExpressionFilter(final Supplier<Expr> expr, final 
FilterTuning filterTuni
     this.filterTuning = filterTuning;
   }
 
+  @Override
+  public boolean canVectorizeMatcher(ColumnInspector inspector)
+  {
+    return expr.get().canVectorize(inspector);
+  }
+
+  @Override
+  public VectorValueMatcher makeVectorMatcher(VectorColumnSelectorFactory 
factory)
+  {
+    final Expr theExpr = expr.get();
+
+    DruidPredicateFactory predicateFactory = new DruidPredicateFactory()
+    {
+      @Override
+      public Predicate<String> makeStringPredicate()
+      {
+        return Evals::asBoolean;
+      }
+
+      @Override
+      public DruidLongPredicate makeLongPredicate()
+      {
+        return Evals::asBoolean;
+      }
+
+      @Override
+      public DruidFloatPredicate makeFloatPredicate()
+      {
+        return Evals::asBoolean;
+      }
+
+      @Override
+      public DruidDoublePredicate makeDoublePredicate()
+      {
+        return Evals::asBoolean;
+      }
+
+      // The hashcode and equals are to make 
SubclassesMustOverrideEqualsAndHashCodeTest stop complaining..
+      // DruidPredicateFactory doesn't really need equals or hashcode, in fact 
only the 'toString' method is called

Review comment:
       yeah, I was planning on addressing this in a separate PR, the comments 
are partially to help remind us to do it in case I didn't get to it immediately

##########
File path: 
processing/src/main/java/org/apache/druid/segment/filter/ExpressionFilter.java
##########
@@ -55,6 +69,105 @@ public ExpressionFilter(final Supplier<Expr> expr, final 
FilterTuning filterTuni
     this.filterTuning = filterTuning;
   }
 
+  @Override
+  public boolean canVectorizeMatcher(ColumnInspector inspector)
+  {
+    return expr.get().canVectorize(inspector);
+  }
+
+  @Override
+  public VectorValueMatcher makeVectorMatcher(VectorColumnSelectorFactory 
factory)
+  {
+    final Expr theExpr = expr.get();
+
+    DruidPredicateFactory predicateFactory = new DruidPredicateFactory()
+    {
+      @Override
+      public Predicate<String> makeStringPredicate()
+      {
+        return Evals::asBoolean;
+      }
+
+      @Override
+      public DruidLongPredicate makeLongPredicate()
+      {
+        return Evals::asBoolean;
+      }
+
+      @Override
+      public DruidFloatPredicate makeFloatPredicate()
+      {
+        return Evals::asBoolean;
+      }
+
+      @Override
+      public DruidDoublePredicate makeDoublePredicate()
+      {
+        return Evals::asBoolean;
+      }
+
+      // The hashcode and equals are to make 
SubclassesMustOverrideEqualsAndHashCodeTest stop complaining..
+      // DruidPredicateFactory doesn't really need equals or hashcode, in fact 
only the 'toString' method is called
+      // when testing equality of DimensionPredicateFilter, so it's the truly 
required method, but even that seems
+      // strange at best.
+      // Rather than tackle removing the annotation and reworking equality 
tests for now, will leave this to refactor
+      // as part of https://github.com/apache/druid/issues/8256 which suggests 
combining Filter and DimFilter
+      // implementations, which should clean up some of this mess.
+      @Override
+      public int hashCode()
+      {
+        return super.hashCode();
+      }
+
+      @Override
+      public boolean equals(Object obj)
+      {
+        return super.equals(obj);
+      }
+    };
+
+
+    final ExprType outputType = theExpr.getOutputType(factory);
+
+    if (outputType == null) {
+      // if an expression is vectorizable, but the output type is null, the 
result will be null (or the default
+      // value in default mode) because expression is either all null 
constants or missing columns
+
+      // in sql compatible mode, this means no matches ever, so just use the 
false matcher:
+      if (NullHandling.sqlCompatible()) {
+        return new FalseVectorMatcher(factory.getVectorSizeInspector());
+      }
+      // in default mode, just fallback to using a long matcher since nearly 
all boolean-ish expressions
+      // output a long value so it is probably a safe bet? idk, ending up here 
by using all null-ish things
+      // in default mode is dancing on the edge of insanity anyway...
+      return 
VectorValueMatcherColumnProcessorFactory.instance().makeLongProcessor(
+          
ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.LONG),
+          ExpressionVectorSelectors.makeVectorValueSelector(factory, theExpr)
+      ).makeMatcher(predicateFactory);
+    }
+
+    switch (outputType) {
+      case LONG:
+        return 
VectorValueMatcherColumnProcessorFactory.instance().makeLongProcessor(
+            
ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.LONG),
+            ExpressionVectorSelectors.makeVectorValueSelector(factory, theExpr)
+        ).makeMatcher(predicateFactory);
+      case DOUBLE:
+        return 
VectorValueMatcherColumnProcessorFactory.instance().makeDoubleProcessor(
+            
ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.DOUBLE),
+            ExpressionVectorSelectors.makeVectorValueSelector(factory, theExpr)
+        ).makeMatcher(predicateFactory);
+      case STRING:
+        return 
VectorValueMatcherColumnProcessorFactory.instance().makeObjectProcessor(
+            // using 'numeric' capabilities creator so we are configured to 
NOT be dictionary encoded, etc
+            
ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.STRING),

Review comment:
       it is strange, but on purpose for the reason the comment says since the 
numeric capabilities defaults are correct in this case. I will make sure all 
current users are numeric types, and if so then just make these capabilites 
from scratch because we should be counting on something geared towards numeric 
behavior in case those change in some unexpected way. Or, if it is used by 
other non-numbers then rename it to something more appropriate, like 
`createSimpleSingleValueColumnCapabilites` or similar

##########
File path: 
processing/src/main/java/org/apache/druid/segment/vector/VectorColumnSelectorFactory.java
##########
@@ -57,6 +57,15 @@ default int getMaxVectorSize()
    */
   MultiValueDimensionVectorSelector 
makeMultiValueDimensionSelector(DimensionSpec dimensionSpec);
 
+  /**
+   * Returns an object selector for string columns, useful for non-dictionary 
encoded strings, or when

Review comment:
       hmm, i have no idea why I named this function like this, it should 
probably be called `makeDimensionObjectSelector`. Its just expected to make a 
`VectorObjectSelector` from a `DimensionSpec` rather than a raw column name

##########
File path: 
processing/src/main/java/org/apache/druid/query/filter/DruidPredicateFactory.java
##########
@@ -27,6 +27,13 @@
 {
   Predicate<String> makeStringPredicate();
 
+  default Predicate<Object> makeObjectPredicate()
+  {
+    // default to try to use string predicate;
+    final Predicate<String> stringPredicate = makeStringPredicate();
+    return o -> stringPredicate.apply((String) o);

Review comment:
       uh, it is sort of built on hope right now... the hope that it will 
probably be a string because the only context it will currently happen in is 
from making vector matcher object processors on string expression columns, but 
if we opened it up to additional non-primitive number types like array 
expressions and complex column types then this would need to be reconsidered, 
or stronger checks in place that the filter supports filtering on the 
underlying object type. I might need to think about this a bit deeper.

##########
File path: 
processing/src/main/java/org/apache/druid/query/filter/DruidPredicateFactory.java
##########
@@ -27,6 +27,13 @@
 {
   Predicate<String> makeStringPredicate();
 
+  default Predicate<Object> makeObjectPredicate()
+  {
+    // default to try to use string predicate;

Review comment:
       currently I don't think it should be possible because only expression 
filter will make a matcher factory with object matching predicate and only for 
string column types, but i guess cast exceptions if an object selector of a 
complex column somehow got in here from future code, see other comment about 
thinking a bit further about this 😅 

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/vector/GroupByVectorColumnProcessorFactory.java
##########
@@ -64,7 +65,16 @@ public GroupByVectorColumnSelector 
makeMultiValueDimensionProcessor(
         ValueType.STRING == capabilities.getType(),
         "groupBy dimension processors must be STRING typed"
     );
-    throw new UnsupportedOperationException("Multi-value dimensions not yet 
implemented for vectorized groupBys");
+    throw new UnsupportedOperationException("Multi-value dimensions are not 
yet implemented for vectorized groupBys");
+  }
+
+  @Override
+  public GroupByVectorColumnSelector makeObjectProcessor(
+      ColumnCapabilities capabilities,
+      VectorObjectSelector selector
+  )
+  {
+    throw new UnsupportedOperationException("Object columns are not yet 
implemented for vectorized groupBys");

Review comment:
       yes, I have a related branch that I will open as a follow-up to this PR 
that adds support for group-by queries on vectorized string expressions, using 
an object selector and a newly added dictionary building 
`GroupByVectorColumnSelector` similar to the non-vectorized group by code-path. 
In that branch this part checks that the column capabilities are string typed, 
and if so then will make the new dictionary building selector, else throw an 
exception for being an unsupported type.
   
   It seems beneficial to not have to make the string expression vector results 
have to masquerade as dictionary encoded, to be translated to string values to 
be re-encoded into a new dictionary, which is part of my reason for pushing for 
object selectors for the use case

##########
File path: 
processing/src/main/java/org/apache/druid/query/dimension/DimensionSpec.java
##########
@@ -70,6 +71,11 @@ default MultiValueDimensionVectorSelector 
decorate(MultiValueDimensionVectorSele
     throw new UOE("DimensionSpec[%s] cannot vectorize", getClass().getName());
   }
 
+  default VectorObjectSelector decorate(VectorObjectSelector selector)

Review comment:
       This is used with the poorly named 
`VectorColumnSelectorFactory.makeStringObjectSelector`, which should be renamed 
`VectorColumnSelectorFactory.makeDimensionObjectSelector` or something, which 
is for doing dimensiony things with a vector object selector similar to what is 
done for `SingleValueDimensionVectorSelector` or 
`MultiValueDimensionVectorSelector`. There are currently no implementations of 
this, like is the case with the other decorate methods with vector selectors, i 
was just trying for symmetry here so that the object selector could behave 
similar to the other string selectors when used in this context

##########
File path: 
core/src/main/java/org/apache/druid/math/expr/BinaryLogicalOperatorExpr.java
##########
@@ -74,7 +74,7 @@ public ExprType getOutputType(InputBindingInspector inspector)
   @Override
   public boolean canVectorize(InputBindingInspector inspector)
   {
-    return inspector.areNumeric(left, right) && inspector.canVectorize(left, 
right);

Review comment:
       i _think_ canVectorize will still end up false because array 
types/functions are not yet vectorizable, but yeah i guess array types would 
not be well handled here even if they were, though maybe that is the case for 
non-vectorized expression as well. I'll do a bit of exploration and see what up




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on a change in pull request #10613: expression filter support for vectorized query engines

Reply via email to