paul-rogers commented on issue #13816: URL: https://github.com/apache/druid/issues/13816#issuecomment-1474328597
Discussions continue about the user experience that Druid wants to provide. In sharp contrast to the above, another view is that rollup is not a _datasource_ property, but rather an _ingestion_ decision. That is, there is no such thing as a "rollup datasource": there are only datasources, some of which happen to contain aggregated data. (The analogy with Postgres tables is often cited.) At ingest time, a user may choose to write aggregated data into a datasource, or to write pre-aggregated "detail" data. At compaction time, users may choose to further aggregate data, or to aggregate data that was originally detail. (The common use case that the first month of data is detail, the next month is rolled up to 5 minute grain, the final month is at hour grain.) Given this, perhaps the best solution is for the catalog to provide no support at all. Rather, ingestion queries and compaction specs provide the support. We can add a `WITH ROLLUP` keyword to MSQ ingestion to tell the system to write intermediate aggregates to segments. But, the decision about what is dimension and what is a metric is expressed in the SQL for ingestion, which allows decisions to evolve over time. Since rewriting the same SQL statement for every batch ingestion is tedious, we can learn from streaming, which stores the ingestion spec within a supervisor. For batch, maybe allow the user to store the SQL statement within Druid as a "procedure" (say) which is then invoked as, (say) `CALL MY_INGESTION(ARRAY["file1.csv", "file2.csv"])`. The rollup "schema" is stored in the SQL statement, not in the table metadata. In this case, the table schema might not even be needed: the schema is whatever the SQL statement says it is, just as the dimensions and metrics are whatever the SQL statement says they are. We end up with a novel hybrid of DDL, stored procedure and DML all in a single statement: no redundancy, just one compact way to specify table schema, rollup and ingestion in a single unit. Another way of expressing this idea is that SQL (and the original catalog concept) are based on the idea of declaring what you want, then letting the engine work out how to achieve it. The alternative approach is more procedural: what you want is encoded in the statements needed to produce that result. SQL is declarative, while Java (etc.) are procedural. The approach sketched here would double-down on Druid's historical procedural approach to defining datasources. A procedural approach may be more familiar to the Druid target audience, while the SQL declarative approach may be a bit foreign. Such an approach is novel: older cube-based engines ask the user to declare the cube schema (dimensions and measures, with aggregate functions or each measure). But, the solution sketched in this comment might be the right solution for Druid given its own history and target users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
