[GitHub] [druid] paul-rogers commented on issue #13816: Improve MSQ Rollup Experience with Catalog Integration

via GitHub Fri, 17 Mar 2023 12:42:07 -0700


paul-rogers commented on issue #13816:
URL: https://github.com/apache/druid/issues/13816#issuecomment-1474328597


   Discussions continue about the user experience that Druid wants to provide. 
In sharp contrast to the above, another view is that rollup is not a 
_datasource_ property, but rather an _ingestion_ decision. That is, there is no 
such thing as a "rollup datasource": there are only datasources, some of which 
happen to contain aggregated data. (The analogy with Postgres tables is often 
cited.) At ingest time, a user may choose to write aggregated data into a 
datasource, or to write pre-aggregated "detail" data.
   
   At compaction time, users may choose to further aggregate data, or to 
aggregate data that was originally detail. (The common use case that the first 
month of data is detail, the next month is rolled up to 5 minute grain, the 
final month is at hour grain.)
   
   Given this, perhaps the best solution is for the catalog to provide no 
support at all. Rather, ingestion queries and compaction specs provide the 
support. We can add a `WITH ROLLUP` keyword to MSQ ingestion to tell the system 
to write intermediate aggregates to segments. But, the decision about what is 
dimension and what is a metric is expressed in the SQL for ingestion, which 
allows decisions to evolve over time.
   
   Since rewriting the same SQL statement for every batch ingestion is tedious, 
we can learn from streaming, which stores the ingestion spec within a 
supervisor. For batch, maybe allow the user to store the SQL statement within 
Druid as a "procedure" (say) which is then invoked as, (say) `CALL 
MY_INGESTION(ARRAY["file1.csv", "file2.csv"])`. The rollup "schema" is stored 
in the SQL statement, not in the table metadata.
   
   In this case, the table schema might not even be needed: the schema is 
whatever the SQL statement says it is, just as the dimensions and metrics are 
whatever the SQL statement says they are. We end up with a novel hybrid of DDL, 
stored procedure and DML all in a single statement: no redundancy, just one 
compact way to specify table schema, rollup and ingestion in a single unit.
   
   Another way of expressing this idea is that SQL (and the original catalog 
concept) are based on the idea of declaring what you want, then letting the 
engine work out how to achieve it. The alternative approach is more procedural: 
what you want is encoded in the statements needed to produce that result. SQL 
is declarative, while Java (etc.) are procedural. The approach sketched here 
would double-down on Druid's historical procedural approach to defining 
datasources. A procedural approach may be more familiar to the Druid target 
audience, while the SQL declarative approach may be a bit foreign.
   
   Such an approach is novel: older cube-based engines ask the user to declare 
the cube schema (dimensions and measures, with aggregate functions or each 
measure). But, the solution sketched in this comment might be the right 
solution for Druid given its own history and target users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #13816: Improve MSQ Rollup Experience with Catalog Integration

Reply via email to