Re: [PR] Support Union in Decoupled planning (druid)

via GitHub Wed, 30 Oct 2024 15:15:59 -0700


imply-cheddar commented on code in PR #17354:
URL: https://github.com/apache/druid/pull/17354#discussion_r1823432991



##########
processing/src/main/java/org/apache/druid/query/DefaultQueryRunnerFactoryConglomerate.java:
##########
@@ -19,29 +19,52 @@
 
 package org.apache.druid.query;
 
+import com.google.common.collect.Maps;
 import com.google.inject.Inject;
-
 import java.util.IdentityHashMap;
 import java.util.Map;
 
-/**
-*/
 public class DefaultQueryRunnerFactoryConglomerate implements 
QueryRunnerFactoryConglomerate
 {
   private final Map<Class<? extends Query>, QueryRunnerFactory> factories;
+  private final Map<Class<? extends Query>, QueryToolChest> toolchests;
+
+  public static DefaultQueryRunnerFactoryConglomerate 
buildFromQueryRunnerFactories(
+      Map<Class<? extends Query>, QueryRunnerFactory> factories)
+  {
+    return new DefaultQueryRunnerFactoryConglomerate(factories, 
Maps.transformValues(factories, f -> f.getToolchest()));
+  }
 
   @Inject
-  public DefaultQueryRunnerFactoryConglomerate(Map<Class<? extends Query>, 
QueryRunnerFactory> factories)
+  public DefaultQueryRunnerFactoryConglomerate(Map<Class<? extends Query>, 
QueryRunnerFactory> factories,
+      Map<Class<? extends Query>, QueryToolChest> toolchests)

Review Comment:
   I think that the whitespace here will trip up checkstyle.



##########
processing/src/main/java/org/apache/druid/query/union/UnionQueryQueryToolChest.java:
##########
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.union;
+
+import com.fasterxml.jackson.core.type.TypeReference;
+import com.google.common.base.Function;
+import com.google.common.base.Functions;
+import com.google.inject.Inject;
+import org.apache.druid.error.DruidException;
+import org.apache.druid.frame.allocation.MemoryAllocatorFactory;
+import org.apache.druid.java.util.common.guava.Sequence;
+import org.apache.druid.query.DefaultQueryMetrics;
+import org.apache.druid.query.FrameSignaturePair;
+import org.apache.druid.query.Query;
+import org.apache.druid.query.QueryLogic;
+import org.apache.druid.query.QueryMetrics;
+import org.apache.druid.query.QueryRunner;
+import org.apache.druid.query.QueryRunnerFactoryConglomerate;
+import org.apache.druid.query.QuerySegmentWalker;
+import org.apache.druid.query.QueryToolChest;
+import org.apache.druid.query.aggregation.MetricManipulationFn;
+import org.apache.druid.segment.column.RowSignature;
+import org.apache.druid.segment.column.RowSignature.Finalization;
+
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+public class UnionQueryQueryToolChest extends QueryToolChest<Object, 
UnionQuery>

Review Comment:
   With the introduction of the `QueryLogic` interface, it *should* be possible 
to completely avoid implementing this class altogether.  If it's not, I'm 
curious why...



##########
server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java:
##########
@@ -218,6 +231,7 @@ public <T> QueryRunner<T> getQueryRunnerForIntervals(final 
Query<T> query, final
       // Dry run didn't go well.
       throw new ISE("Cannot handle subquery structure for dataSource: %s", 
query.getDataSource());
     }
+    assert toolChest.canExecuteFully(query.withDataSource(inlineDryRun));

Review Comment:
   Let's not commit the assert.  I've seen asserts show up in profiling of 
production systems before.  I know that they are supposed to be noops and I 
have no idea how to explain why I've seen them show up.  If it's truly 
important to check this here, then it should be an if with an exception, if 
production ever fails from this assert, an `AssertionError` doesn't really 
provide much help.



##########
processing/src/main/java/org/apache/druid/frame/allocation/ArenaMemoryAllocatorFactory.java:
##########
@@ -42,4 +44,9 @@ public long allocatorCapacity()
   {
     return capacity;
   }
+
+  public static MemoryAllocatorFactory makeDefault()
+  {
+    return new ArenaMemoryAllocatorFactory(FRAME_SIZE);
+  }

Review Comment:
   I have a general fear of configs being hidden inside of static initializers. 
 I had assumed this was for tests, but just checked and it's not.  I think it 
would be better to have the default be in the place that you have calling this. 
 That is the type of parameter that it is often useful to adjust via a query 
context as well, so I'd suggest that it be a default read from the context.  
But, I'd also be okay with hard-coding the default there (I prefer it to exist 
there as it will be easier for a future developer to see it and adjust, where 
the pattern I've seen with these types of statics is that people see them, 
assume it's a good idea to use it in all of the places possible and then 
eventually you are trying to unwind it because you want it to be configurable)



##########
processing/src/main/java/org/apache/druid/query/DefaultQueryRunnerFactoryConglomerate.java:
##########
@@ -19,29 +19,52 @@
 
 package org.apache.druid.query;
 
+import com.google.common.collect.Maps;
 import com.google.inject.Inject;
-
 import java.util.IdentityHashMap;
 import java.util.Map;
 
-/**
-*/
 public class DefaultQueryRunnerFactoryConglomerate implements 
QueryRunnerFactoryConglomerate
 {
   private final Map<Class<? extends Query>, QueryRunnerFactory> factories;
+  private final Map<Class<? extends Query>, QueryToolChest> toolchests;
+
+  public static DefaultQueryRunnerFactoryConglomerate 
buildFromQueryRunnerFactories(
+      Map<Class<? extends Query>, QueryRunnerFactory> factories)
+  {
+    return new DefaultQueryRunnerFactoryConglomerate(factories, 
Maps.transformValues(factories, f -> f.getToolchest()));
+  }
 
   @Inject
-  public DefaultQueryRunnerFactoryConglomerate(Map<Class<? extends Query>, 
QueryRunnerFactory> factories)
+  public DefaultQueryRunnerFactoryConglomerate(Map<Class<? extends Query>, 
QueryRunnerFactory> factories,
+      Map<Class<? extends Query>, QueryToolChest> toolchests)
   {
-    // Accesses to IdentityHashMap should be faster than to HashMap or 
ImmutableMap.
-    // Class doesn't override Object.equals().
     this.factories = new IdentityHashMap<>(factories);
+    this.toolchests = new IdentityHashMap<>(toolchests);
   }
 
   @Override
   @SuppressWarnings("unchecked")
   public <T, QueryType extends Query<T>> QueryRunnerFactory<T, QueryType> 
findFactory(QueryType query)
   {
-    return (QueryRunnerFactory<T, QueryType>) factories.get(query.getClass());
+    return factories.get(query.getClass());
+  }
+
+  @Override
+  @SuppressWarnings("unchecked")
+  public <T, QueryType extends Query<T>> QueryToolChest<T, QueryType> 
getToolChest(QueryType query)
+  {
+    return toolchests.get(query.getClass());
+  }
+
+  @Override
+  @SuppressWarnings("unchecked")
+  public <T, QueryType extends Query<T>> QueryLogic getQueryLogic(QueryType 
query)
+  {
+    QueryToolChest<T, QueryType> toolchest = getToolChest(query);
+    if (toolchest instanceof QueryLogic) {
+      return (QueryLogic) toolchest;
+    }

Review Comment:
   I had imagined this as a completely brand new thing that gets registered.  
I.e. a 3rd Map being passed into the constructor here.  I'm not sure that 
there's a benefit to having them combined.  I just noticed that you left a 
comment suggesting this as well, but that you'd prefer to not do it just yet.  
I guess that's okay, but if it were me, I'd add the 3rd map and move on.



##########
processing/src/main/java/org/apache/druid/query/union/UnionQueryRunner.java:
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.union;
+
+import org.apache.druid.java.util.common.guava.Sequence;
+import org.apache.druid.java.util.common.guava.Sequences;
+import org.apache.druid.query.Query;
+import org.apache.druid.query.QueryLogic;
+import org.apache.druid.query.QueryPlus;
+import org.apache.druid.query.QueryRunner;
+import org.apache.druid.query.QueryRunnerFactoryConglomerate;
+import org.apache.druid.query.QuerySegmentWalker;
+import org.apache.druid.query.context.ResponseContext;
+
+import java.util.ArrayList;
+import java.util.List;
+
+class UnionQueryRunner implements QueryRunner<Object>
+{
+  private final QueryRunnerFactoryConglomerate conglomerate;
+  private final QuerySegmentWalker walker;
+  private final List<QueryRunner> runners;
+
+  public UnionQueryRunner(
+      UnionQuery query,
+      QueryRunnerFactoryConglomerate conglomerate,
+      QuerySegmentWalker walker)
+  {
+    this.conglomerate = conglomerate;
+    this.walker = walker;
+    this.runners = makeSubQueryRunners(query);
+  }
+
+  private List<QueryRunner> makeSubQueryRunners(UnionQuery unionQuery)
+  {
+    List<QueryRunner> runners = new ArrayList<>();
+    for (Query<?> query : unionQuery.queries) {
+      runners.add(buildRunnerFor(query));
+    }
+    return runners;
+  }
+
+  private QueryRunner<?> buildRunnerFor(Query<?> query)
+  {
+    QueryLogic queryLogic = getQueryLogicFor(query);
+    return queryLogic.entryPoint(query, walker);
+  }
+
+  private QueryLogic getQueryLogicFor(Query<?> query)
+  {
+    QueryLogic queryLogic = conglomerate.getQueryLogic(query);
+    if (queryLogic != null) {
+      return queryLogic;
+    }
+    return new ToolChestBasedQueryLogic(conglomerate.getToolChest(query));

Review Comment:
   If I'm reading this code correctly, you have this object basically in order 
to wrap the QueryRunner such that it produces results with the correct output 
for the Union query.  That's an absolutely necessary bit of logic.  I don't 
think it's really correct to bundle that logic up into a `QueryLogic`.  A 
`QueryLogic` should be something that contains the nuggets of query processing 
and should be associated with the query type itself.  A developer would choose 
to implement a QueryLogic, but the existence of the `ToolChestBasedQueryLogic` 
makes it seem as if the class is a compatibility layer that can provide a quick 
"lift-and-shift" of existing `QueryToolChest` objects into the `QueryLogic` 
interface.  But, I don't think the implementation is actually that robust, it 
looks to be more specific to what this `UnionQueryRunner` needs.
   
   So, I suggest that you inline that logic here.  Specifically, you are 
returning this object and then immediately calling `.entryPoint()` on it.  In 
that case, you should be able to just put all of that code here and wrap the 
`QueryRunner` that is returned from `query.getRunner(walker)`.



##########
processing/src/main/java/org/apache/druid/error/DruidException.java:
##########
@@ -524,4 +524,8 @@ public String getErrorCode()
     protected abstract DruidException makeException(DruidExceptionBuilder bob);
   }
 
+  public static DruidException methodNotSupported()
+  {
+    return defensive("Method Not supported.");
+  }

Review Comment:
   A bit of a nit.  I find it useful for engineers to include words about why 
they think that it's not supported, so I'd prefer that this takes a String and 
people provide a reason of some sort.
   
   Fwiw, there is also a `NotYetImplemented` type that can be used for 
something that should likely be implemented but isn't yet.  Part of me wonders 
if the "this is completely not supported" should follow a similar pattern and 
have it's own class type.



##########
server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java:
##########
@@ -375,7 +385,43 @@ private <T> DataSource inlineIfNecessary(
     if (dataSource instanceof QueryDataSource) {
       // This datasource is a subquery.
       final Query subQuery = ((QueryDataSource) dataSource).getQuery();
-      final QueryToolChest toolChest = warehouse.getToolChest(subQuery);
+      final QueryToolChest toolChest = conglomerate.getToolChest(subQuery);
+      final QueryLogic subQueryLogic = conglomerate.getQueryLogic(subQuery);
+
+      if (subQueryLogic != null) {
+        final Sequence<?> queryResults;
+
+        if (dryRun) {
+          queryResults = Sequences.empty();
+        } else {
+          Query subQueryWithSerialization = subQuery.withOverriddenContext(
+              Collections.singletonMap(
+                  ResultSerializationMode.CTX_SERIALIZATION_PARAMETER,
+                  
ClientQuerySegmentWalkerUtils.getLimitType(maxSubqueryMemory, 
cannotMaterializeToFrames.get())
+                      .serializationMode()
+                      .toString()
+              )
+          );
+          queryResults = subQueryLogic
+              .entryPoint(subQueryWithSerialization, this)
+              .run(QueryPlus.wrap(subQueryWithSerialization), 
DirectDruidClient.makeResponseContextForQuery());
+        }
+
+        return toInlineDataSource(
+            subQuery,
+            queryResults,
+            conglomerate.getToolChest(subQuery),

Review Comment:
   you already got the toolChest on 388



##########
server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java:
##########
@@ -289,17 +303,15 @@ public <T> QueryRunner<T> getQueryRunnerForSegments(final 
Query<T> query, final
    */
   private <T> boolean canRunQueryUsingLocalWalker(Query<T> query)
   {
-    final DataSource dataSourceFromQuery = query.getDataSource();
-    final DataSourceAnalysis analysis = dataSourceFromQuery.getAnalysis();
-    final QueryToolChest<T, Query<T>> toolChest = 
warehouse.getToolChest(query);
+    final DataSourceAnalysis analysis = query.getDataSourceAnalysis();
+    final QueryToolChest<T, Query<T>> toolChest = 
conglomerate.getToolChest(query);
 
     // 1) Must be based on a concrete datasource that is not a table.
     // 2) Must be based on globally available data (so we have a copy here on 
the Broker).
     // 3) If there is an outer query, it must be handleable by the query 
toolchest (the local walker does not handle
     //    subqueries on its own).
-    return analysis.isConcreteBased() && !analysis.isConcreteAndTableBased() 
&& dataSourceFromQuery.isGlobal()
-           && (!(dataSourceFromQuery instanceof QueryDataSource)
-               || toolChest.canPerformSubquery(((QueryDataSource) 
dataSourceFromQuery).getQuery()));
+    return analysis.isConcreteBased() && !analysis.isConcreteAndTableBased() 
&& analysis.isGlobal()
+        && toolChest.canExecuteFully(query);

Review Comment:
   IIRC, the reason you need to check this `canExecuteFully` is when you have a 
`ScanQuery` wrapping a Union.  Generally speaking, part of the goal of 
introducing the `QueryLogic` is to work around the `ClientQuerySegmentWalker`'s 
insistence that is knows which query to run first.  Given that, I think that 
any query with implements `QueryLogic` should be excluded from the LocalWalker 
(return false) and only return true for the ClusterWalker (i.e. if the query 
implements `QueryLogic` and it wants to do "local" style things, the 
`QueryLogic` itself will implement it instead of counting on the 
`QuerySegmentWalker` to guess correctly).
   
   So, I wonder if we can adjust these booleans to just check for whether the 
`Query` has an implementation of `QueryLogic` and if it does, then answer 
accordingly.  I think that will mean that you no longer need the method on 
`QueryToolChest` and probably also eliminate the need to implement the 
`QueryToolChest` at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Support Union in Decoupled planning (druid)

Reply via email to