[GitHub] [druid] paul-rogers commented on a diff in pull request #13793: Add validation for aggregations on __time

via GitHub Wed, 01 Mar 2023 18:31:56 -0800


paul-rogers commented on code in PR #13793:
URL: https://github.com/apache/druid/pull/13793#discussion_r1122534301



##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java:
##########
@@ -196,6 +197,16 @@ public Aggregation toDruidAggregation(
 
     final String fieldName = getColumnName(plannerContext, 
virtualColumnRegistry, args.get(0), rexNodes.get(0));
 
+    if (!rowSignature.contains(ColumnHolder.TIME_COLUMN_NAME) && 
(aggregatorType == AggregatorType.LATEST || aggregatorType == 
AggregatorType.EARLIEST)) {
+      plannerContext.setPlanningError("%s() aggregator depends on __time 
column, the underlying datasource "
+                                      + "or extern function you are querying 
doesn't contain __time column, "
+                                      + "Please use %s_BY() and specify the 
time column you want to use",
+                                      aggregatorType.name(),
+                                      aggregatorType.name()
+      );
+      return null;
+    }

Review Comment:
   I get that the user *could* use the alternative. The goal here, _should be_ 
that the user can use the actual features which Druid claims it supports.
   
   That is, I'm asserting that the goal of this PR should not be not to accept 
broken behavior, and tell the user not to use the broken stuff, but to fix the 
behavior so what should work _does_ work.
   
   In short, we don't want to wrap a bug with nicer error messages. We want to 
fix the bug. (IMHO.)



##########
extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/exec/MSQSelectTest.java:
##########
@@ -1363,6 +1363,39 @@ public void testGroupByArrayWithMultiValueMvToArray()
         .verifyResults();
   }
 
+  @Test
+  public void testTimeColumnAggregationFromExtern() throws IOException
+  {
+    final File toRead = 
MSQTestFileUtils.getResourceAsTemporaryFile(temporaryFolder, this, 
"/wikipedia-sampled.json");
+    final String toReadAsJson = 
queryFramework().queryJsonMapper().writeValueAsString(toRead.getAbsolutePath());
+
+    RowSignature rowSignature = RowSignature.builder()
+                                            .add("__time", ColumnType.LONG)
+                                            .add("cnt", ColumnType.LONG)
+                                            .build();
+
+    testSelectQuery()
+        .setSql("WITH\n"
+                + "kttm_data AS (\n"
+                + "SELECT * FROM TABLE(\n"
+                + "  EXTERN(\n"
+                + "    '{ \"files\": [" + toReadAsJson + 
"],\"type\":\"local\"}',\n"
+                + "    '{\"type\":\"json\"}',\n"
+                + "    '[{\"name\": \"timestamp\", \"type\": \"string\"}, 
{\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": 
\"string\"}]'\n"
+                + "  )\n"
+                + "))\n"
+                + "\n"
+                + "SELECT\n"
+                + "  FLOOR(TIME_PARSE(\"timestamp\") TO MINUTE) AS __time,\n"
+                + "  LATEST(\"page\") AS \"page\"\n"
+                + "FROM kttm_data "
+                + "GROUP BY 1")
+        .setExpectedValidationErrorMatcher(
+            
ThrowableMessageMatcher.hasMessage(CoreMatchers.containsString("LATEST() 
aggregator depends on __time column"))

Review Comment:
   Sorry, I didn't follow. Maybe my note was too terse?
   
   The query here should be legal. `page` is a `VARCHAR` (Druid `string`) 
column. Segments allow `LATEST(string)` (though not `LATEST` of numeric types.) 
MSQ is the modern way to create rollup tables.
   
   Now, this query might fail because we haven't passed the context parameter 
to ask for a rolled up datasource. But, in that case, aggregations are applied 
first, and the results written to the datasource as a detail table. In that 
case, the above query should also work.
   
   The test, however, expects the query to fail. My comment is noting that this 
is not what should happen. (I get that it is what this PR _wants_ to happen.) 
What should happen is that this is legal, and we emit segments based on the 
`finalzieAggregations` context parameter.
   
   So, the ask is simple: let's find out what it take so that this query passes 
and (with the right context setting), emits an intermediate 
`COMPLEX<Pair<string, long>>` value into the output segments.



##########
sql/src/test/java/org/apache/druid/sql/calcite/CalciteJoinQueryTest.java:
##########
@@ -1447,12 +1447,24 @@ public void testInnerJoinQueryOfLookup(Map<String, 
Object> queryContext)
                 .build()
         ),
         ImmutableList.of(
-            new Object[]{"", "a", "xabc", "xabc"},
-            new Object[]{"1", "a", "xabc", "xabc"}
+            new Object[]{"", "a", "xa", "xa"},
+            new Object[]{"1", "a", "xa", "xa"}
         )
     );
   }
 
+  @Test(expected = UnsupportedSQLQueryException.class)

Review Comment:
   Absolutely. Every datasource row has a `__time` column.
   
   Here's another way to express this for the `SELECT` case. If the query works 
for the interactive SQL engine, it should work for the MSQ query engine. It is 
not a benefit to the user that `LATEST(x)` works on one engine but not the 
other. It is, instead, a bug to be fixed.
   
   It certainly true that it might be hard to fix given how MSQ does things for 
some reason. That just means fixing it might be hard.
   
   If, however, this query does not work on the interactive SQL engine, then 
we've got larger issues which we may not be able to fix in this PR. One issue 
is, "what does it mean to offer `LATEST(x)` if it doesn't actually work?"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on a diff in pull request #13793: Add validation for aggregations on __time

Reply via email to