paul-rogers commented on code in PR #13793:
URL: https://github.com/apache/druid/pull/13793#discussion_r1122534301
##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java:
##########
@@ -196,6 +197,16 @@ public Aggregation toDruidAggregation(
final String fieldName = getColumnName(plannerContext,
virtualColumnRegistry, args.get(0), rexNodes.get(0));
+ if (!rowSignature.contains(ColumnHolder.TIME_COLUMN_NAME) &&
(aggregatorType == AggregatorType.LATEST || aggregatorType ==
AggregatorType.EARLIEST)) {
+ plannerContext.setPlanningError("%s() aggregator depends on __time
column, the underlying datasource "
+ + "or extern function you are querying
doesn't contain __time column, "
+ + "Please use %s_BY() and specify the
time column you want to use",
+ aggregatorType.name(),
+ aggregatorType.name()
+ );
+ return null;
+ }
Review Comment:
I get that the user *could* use the alternative. The goal here, _should be_
that the user can use the actual features which Druid claims it supports.
That is, I'm asserting that the goal of this PR should not be not to accept
broken behavior, and tell the user not to use the broken stuff, but to fix the
behavior so what should work _does_ work.
In short, we don't want to wrap a bug with nicer error messages. We want to
fix the bug. (IMHO.)
##########
extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/exec/MSQSelectTest.java:
##########
@@ -1363,6 +1363,39 @@ public void testGroupByArrayWithMultiValueMvToArray()
.verifyResults();
}
+ @Test
+ public void testTimeColumnAggregationFromExtern() throws IOException
+ {
+ final File toRead =
MSQTestFileUtils.getResourceAsTemporaryFile(temporaryFolder, this,
"/wikipedia-sampled.json");
+ final String toReadAsJson =
queryFramework().queryJsonMapper().writeValueAsString(toRead.getAbsolutePath());
+
+ RowSignature rowSignature = RowSignature.builder()
+ .add("__time", ColumnType.LONG)
+ .add("cnt", ColumnType.LONG)
+ .build();
+
+ testSelectQuery()
+ .setSql("WITH\n"
+ + "kttm_data AS (\n"
+ + "SELECT * FROM TABLE(\n"
+ + " EXTERN(\n"
+ + " '{ \"files\": [" + toReadAsJson +
"],\"type\":\"local\"}',\n"
+ + " '{\"type\":\"json\"}',\n"
+ + " '[{\"name\": \"timestamp\", \"type\": \"string\"},
{\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"user\", \"type\":
\"string\"}]'\n"
+ + " )\n"
+ + "))\n"
+ + "\n"
+ + "SELECT\n"
+ + " FLOOR(TIME_PARSE(\"timestamp\") TO MINUTE) AS __time,\n"
+ + " LATEST(\"page\") AS \"page\"\n"
+ + "FROM kttm_data "
+ + "GROUP BY 1")
+ .setExpectedValidationErrorMatcher(
+
ThrowableMessageMatcher.hasMessage(CoreMatchers.containsString("LATEST()
aggregator depends on __time column"))
Review Comment:
Sorry, I didn't follow. Maybe my note was too terse?
The query here should be legal. `page` is a `VARCHAR` (Druid `string`)
column. Segments allow `LATEST(string)` (though not `LATEST` of numeric types.)
MSQ is the modern way to create rollup tables.
Now, this query might fail because we haven't passed the context parameter
to ask for a rolled up datasource. But, in that case, aggregations are applied
first, and the results written to the datasource as a detail table. In that
case, the above query should also work.
The test, however, expects the query to fail. My comment is noting that this
is not what should happen. (I get that it is what this PR _wants_ to happen.)
What should happen is that this is legal, and we emit segments based on the
`finalzieAggregations` context parameter.
So, the ask is simple: let's find out what it take so that this query passes
and (with the right context setting), emits an intermediate
`COMPLEX<Pair<string, long>>` value into the output segments.
##########
sql/src/test/java/org/apache/druid/sql/calcite/CalciteJoinQueryTest.java:
##########
@@ -1447,12 +1447,24 @@ public void testInnerJoinQueryOfLookup(Map<String,
Object> queryContext)
.build()
),
ImmutableList.of(
- new Object[]{"", "a", "xabc", "xabc"},
- new Object[]{"1", "a", "xabc", "xabc"}
+ new Object[]{"", "a", "xa", "xa"},
+ new Object[]{"1", "a", "xa", "xa"}
)
);
}
+ @Test(expected = UnsupportedSQLQueryException.class)
Review Comment:
Absolutely. Every datasource row has a `__time` column.
Here's another way to express this for the `SELECT` case. If the query works
for the interactive SQL engine, it should work for the MSQ query engine. It is
not a benefit to the user that `LATEST(x)` works on one engine but not the
other. It is, instead, a bug to be fixed.
It certainly true that it might be hard to fix given how MSQ does things for
some reason. That just means fixing it might be hard.
If, however, this query does not work on the interactive SQL engine, then
we've got larger issues which we may not be able to fix in this PR. One issue
is, "what does it mean to offer `LATEST(x)` if it doesn't actually work?"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]