[GitHub] [druid] imply-cheddar commented on a diff in pull request #13085: Refactoring the data source before unnest

GitBox Thu, 29 Sep 2022 00:11:36 -0700


imply-cheddar commented on code in PR #13085:
URL: https://github.com/apache/druid/pull/13085#discussion_r983131194



##########
processing/src/main/java/org/apache/druid/query/Queries.java:
##########
@@ -191,11 +191,19 @@ public static <T> Query<T> withBaseDataSource(final 
Query<T> query, final DataSo
   {
     final Query<T> retVal;
 
-    if (query.getDataSource() instanceof QueryDataSource) {
-      final Query<?> subQuery = ((QueryDataSource) 
query.getDataSource()).getQuery();
+    /*
+     * Currently, this method is implemented in terms of a static walk doing a 
bunch of instanceof checks.
+     * We should likely look into moving this functionality into the 
DataSource object itself so that they
+     * can walk and create new objects on their own.  This will be necessary 
as we expand the set of DataSources
+     * that do actual work, as each of them will need to show up in this 
if/then waterfall.
+     */

Review Comment:
   I'm pretty sure it should be safe to move something logically equivalent to 
this method into a method on `DataSource` itself.  Let's do that inside of this 
refactor, so that we leave this refactor with hopefully never needing to have 
random methods of `instanceof` checks again.



##########
processing/src/main/java/org/apache/druid/query/JoinDataSource.java:
##########
@@ -127,12 +155,22 @@ public static JoinDataSource create(
       final String rightPrefix,
       final JoinConditionAnalysis conditionAnalysis,
       final JoinType joinType,
-      final DimFilter leftFilter
+      final DimFilter leftFilter,
+      @Nullable @JacksonInject final JoinableFactoryWrapper 
joinableFactoryWrapper

Review Comment:
   Having `@JacksonInject` on a method that isn't used by Jackson can be 
confusing (it will make future developers try to figure out how Jackson uses 
this method).  I don't think it needs to be there.



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/BaseLeafFrameProcessor.java:
##########
@@ -152,13 +153,23 @@ private boolean initializeSegmentMapFn(final IntSet 
readableInputs)
       segmentMapFn = Function.identity();
       return true;
     } else {
-      final boolean retVal = 
broadcastJoinHelper.buildBroadcastTablesIncrementally(readableInputs);
-
-      if (retVal) {
-        segmentMapFn = broadcastJoinHelper.makeSegmentMapFn(query);
+      if (query.getDataSource() instanceof InputNumberDataSource) {
+        final boolean retVal = 
broadcastJoinHelper.buildBroadcastTablesIncrementally(readableInputs);
+        if (retVal) {
+          InputNumberDataSource inputNumberDataSource = 
(InputNumberDataSource) query.getDataSource();
+          // The InputNumberData source was going through the 
broadcastJoinHelper which
+          // was using the JoinableFactoryWrapper to create segment map 
function.
+          // After refactoring, the segment map function creation is moved to 
data source
+          // Hence for InputNumberDataSource we are setting the broadcast join 
helper for the data source
+          // and moving the segment map function creation there

Review Comment:
   Sorry to pick on a comment, but this comment does what most comments in code 
do: it talks from the context of the developer who is writing the code.  Any 
new developer who comes and reads the code for the first time doesn't even know 
that this code was refactored, let alone know why it was refactored or what the 
state of the code was before the refactor.  Instead of explain the change of 
the refactor, the comment should attempt to help explain what the current, 
post-refactor code is attempting to accomplish.  I.e. something like
   
   ```
   // The InputNumberDataSource requires a BroadcastJoinHelper to be able to 
create its
   // segment map function.  It would be a lot better if the 
InputNumberDataSource actually
   // had a way to get that injected into it on its own, but the relationship 
between these objects
   // was figured out during a refactor and using a setter here seemed like the 
least-bad way to
   // make progress on the refactor without breaking functionality.  Hopefully, 
some future 
   // developer will move this away from a setter.
   ```
   
   Or something like that.



##########
extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/test/MSQTestBase.java:
##########
@@ -565,6 +572,8 @@ private ObjectMapper setupObjectMapper(Injector injector)
                                                                "compaction"
                                                            )
                                                        
).registerSubtypes(ExternalDataSource.class));
+    DruidSecondaryModule.setupJackson(injector, mapper);
+/*

Review Comment:
   Delete the code instead of comment it out please.



##########
server/src/main/java/org/apache/druid/server/LocalQuerySegmentWalker.java:
##########
@@ -94,12 +93,12 @@ public <T> QueryRunner<T> getQueryRunnerForIntervals(final 
Query<T> query, final
 
     final AtomicLong cpuAccumulator = new AtomicLong(0L);
 
-    final Function<SegmentReference, SegmentReference> segmentMapFn = 
joinableFactoryWrapper.createSegmentMapFn(
-        analysis.getJoinBaseTableFilter().map(Filters::toFilter).orElse(null),
-        analysis.getPreJoinableClauses(),
-        cpuAccumulator,
-        analysis.getBaseQuery().orElse(query)
-    );
+    Function<SegmentReference, SegmentReference> segmentMapFn = 
analysis.getDataSource()
+                                                                        
.createSegmentMapFunction(
+                                                                            
query,
+                                                                            
cpuAccumulator
+                                                                        );
+

Review Comment:
   Now that there's a lot less arguments here, I'm pretty sure this formatting 
is no longer the best way to represent this code.  Please condense things.



##########
processing/src/main/java/org/apache/druid/query/InlineDataSource.java:
##########
@@ -102,6 +105,55 @@ public static InlineDataSource fromIterable(
     return new InlineDataSource(rows, signature);
   }
 
+  /**
+   * A very zealous equality checker for "rows" that respects deep equality of 
arrays, but nevertheless refrains
+   * from materializing things needlessly. Useful for unit tests that want to 
compare equality of different
+   * InlineDataSource instances.
+   */
+  private static boolean rowsEqual(final Iterable<Object[]> rowsA, final 
Iterable<Object[]> rowsB)

Review Comment:
   These private methods seem to have moved above some public methods, was it 
just your IDE doing stuff or was there a specific reason?  Generally speaking, 
the flow is always public methods first.



##########
processing/src/main/java/org/apache/druid/segment/join/JoinableFactoryWrapper.java:
##########
@@ -340,12 +192,60 @@ static JoinClauseToFilterConversion convertJoinToFilter(
     return new JoinClauseToFilterConversion(null, false);
   }
 
+  public JoinableFactory getJoinableFactory()
+  {
+    return joinableFactory;
+  }
+
+  /**
+   * Compute a cache key prefix for a join data source. This includes the data 
sources that participate in the RHS of a
+   * join as well as any query specific constructs associated with join data 
source such as base table filter. This key prefix
+   * can be used in segment level cache or result level cache. The function 
can return following wrapped in an

Review Comment:
   A bunch of this code looks like it changed, but I think it just had methods 
re-ordered?  Was that intentional?



##########
processing/src/main/java/org/apache/druid/segment/join/JoinableFactoryWrapper.java:
##########
@@ -340,12 +192,60 @@ static JoinClauseToFilterConversion convertJoinToFilter(
     return new JoinClauseToFilterConversion(null, false);
   }
 
+  public JoinableFactory getJoinableFactory()
+  {
+    return joinableFactory;
+  }
+
+  /**
+   * Compute a cache key prefix for a join data source. This includes the data 
sources that participate in the RHS of a
+   * join as well as any query specific constructs associated with join data 
source such as base table filter. This key prefix
+   * can be used in segment level cache or result level cache. The function 
can return following wrapped in an
+   * Optional
+   * - Non-empty byte array - If there is join datasource involved and caching 
is possible. The result includes
+   * join condition expression, join type and cache key returned by joinable 
factory for each {@link PreJoinableClause}
+   * - NULL - There is a join but caching is not possible. It may happen if 
one of the participating datasource
+   * in the JOIN is not cacheable.
+   *
+   * @param dataSourceAnalysis for the join datasource
+   * @return the optional cache key to be used as part of query cache key
+   * @throws {@link IAE} if this operation is called on a non-join data source
+   */
+  public Optional<byte[]> computeJoinDataSourceCacheKey(

Review Comment:
   The existence of this method is another point where things seem a little 
suspect.  I think the `DataSource` object needs a `public byte[] getCacheKey()` 
method.  Most of the current implementations would just return `new byte[]{}`, 
but the join one should return this (or null if not cacheable).  The unnest 
implementation will need to override the method as well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] imply-cheddar commented on a diff in pull request #13085: Refactoring the data source before unnest

Reply via email to