[GitHub] [druid] paul-rogers commented on issue #13816: Improve MSQ Rollup Experience with Catalog Integration

via GitHub Sun, 19 Mar 2023 15:51:51 -0700


paul-rogers commented on issue #13816:
URL: https://github.com/apache/druid/issues/13816#issuecomment-1475424478

To complete the catalog, we have to decide on the user experience we want
with the catalog, specifically for rollup datasources. It is helpful to start
with first principles: identify the overall user experience that Druid seeks to
offer. To help us identify that experience, this note offers three broad
approaches, each of which seems valid in its own way.

### Data Modeling

At one end of the spectrum, we could focus on the business problem that the
user wants to solve. For rollup, this is to aggregate data in a (variation of)
the [classic data cube](https://en.wikipedia.org/wiki/OLAP_cube) model. In this
view, rows represent business entities and columns represent attributes of
those entities (what the original SQL model called
[domains](https://en.wikipedia.org/wiki/Relation_(database)), and, in
implementations are given by data types).

The job of the designer is to determine the entities needed for an
application, and to define the attributes. An attribute may be “sourceIP”
defined as a string that represents an IPv4 or IPv6 address. Another might be
“sentBytes” defined as an integer number of bytes. And so on. The idea is that
if an application has a design for a table, then the job of ingestion is to map
incoming data into this form, and the job of querying is to produce business
insights on top of a known data model. (For example, from the above, given a
timestamp, we could produce a chart of bytes in per IP address per minute.)

This model is well-known in the RDBMS community and arises, in part, from
the way data has been recorded in business systems from the earliest days of
punched cards though early database systems and onto RDBMS systems.

Traditionally, a query engine would apply the [relational
algebra](https://en.wikipedia.org/wiki/Relational_algebra) (typically starting
from a relational calculus in SQL) to perform queries over the defined entities
(schema). This works because the schema tells the relational engine everything
it needs to know to figure out to work with the data.

Since the system understands the meaning of the data (at least at the
mechanical level), the system can work out how to perform most operations,
which is typically the job of the database planner and optimizer. In systems
that [allow defining data
cubes](https://mondrian.pentaho.com/documentation/schema.php#Cube), the system
also works out how to implement rollup. See also
[IcCube](https://www.iccube.com/support/documentation/user_guide/walkthrough/walkthrough.php)
and [this UI
example](https://www.iccube.com/support/documentation/user_guide/walkthrough/img/wt_creating_measure_large.png).

While this approach is classic, it may not be a good fit for the niche which
Druid occupies. The premise is that the user provides the (extended) schema,
and the system works out how to perform operations. Druid, however, solves a
different problem for which this classic approach may not be the right approach.

### Physical Description

Another view of the catalog is that it is simply a physical description of a
datasource with any business meaning better left to the user. In this view, the
catalog simply records the storage type of each column in those cases where the
user wishes to fix the storage type.

In normal practice, Druid chooses a storage type suitable for the actual
data. For example, the new schemaless feature will faithfully write into
segments the best Druid representation of the data that appears in the input
files. MSQ chooses a data type that is the result of the input data type and
any operations applied to that type. And so on. The catalog would simply place
constraints on this process to, say, always produce a `double` column rather
than sometimes inferring `double` and sometimes `long`.

Since Druid is designed to be “Python-like” (any data type is automatically
converted to any other data type), one could argue that locking down column
storage types is of limited value.

Further, it is not clear if users would be able to make useful decisions
about column types without knowing something about the data. That is, it is
hard to pick a type for “x”, but a bit easier to pick a type if we know it
represents “bytesSent”. As Druid moves to schemaless ingestion, the goal is
that the user _need not_ know about individual columns.

The physical representation would be tedious for aggregates: users would
have to know the intermediate types for each aggregate. For example, it is not
obvious that the intermediate type for `LATEST` is `COMPLEX<PAIR<string,
long>>`.

Druid is built as a series of procedures for ingest, compaction and
querying. Each of those include some notion of column type. Since Druid is not
designed to obtain column types from a catalog, there is little value in
providing that information: it would be a manual process to ensure that the
types specified in each procedure match those in the catalog. (In systems that
work out the types automatically, then, of course, there is value in
centralizing the schema information.)

And, as noted, there is little value in locking down the types since Druid
is designed to work with varying types: Druid faithfully copies the type from
input into segments. Thus, it is not clear what the user gains by taking the
time to specify a physical schema: they get no value or work savings, it is
just another task in addition to the existing tasks, and the user must take
care that all the other tasks agree with the specified schema. The physical
modeling approach would seem to offer very little value.

### Process Modeling

Druid is not a traditional data warehouse. Druid is not typically used for
ETL. More often, Druid is a caching layer that loads existing data into a form
that accelerates querying. In this light, one could argue that data modeling
decisions were already made when producing the source data: they don’t have to
be made again in Druid. Since each Druid datasource represents a single input
source, Druid serves the user best when it faithfully reproduces the original
data in segment form. This is, in fact, the gist of the nested data and
schemaless projects: to ensure a faithful reproduction.

If the original input source is messy, or if the user has to enrich that
input source, or has to map multiple input sources to a single datasource, then
one could argue that many tools exist to assist: Spark, Flink, and more. Thus,
the input source presented to Druid represents the final, clean, result of a
data pipeline. Again, any data modeling decisions have already been made within
that data pipeline.

In this form, the type of data is already determined by the input source:
Druid simply has to mirror that type. Since Druid type system is flexible, the
user can be assured that even if Druid guesses wrong, the data can still act
like the accepted type. (For example, if a value is really a double, but a
particular ingest has only integers, there is no harm in storing that
particular set of data as `long`: the data will be converted to `double` later
when needed.)

Of course, Druid provides expressions in all of its ingestion mechanisms:
the user can adjust types and names as part of the ingestion procedure. Given
that, there is no need to _also_ specify names and types in a catalog.

In this view, there is little value in letting the user pick column types:
types are implementation details. Instead, the value is in helping the user
perform the operations needed to work with the data: ingestion, compaction,
loading data onto the cluster, aging the data, and finally expiring the data.

As it turns out, Druid has many of these operations already, as a collection
of REST operations. So, perhaps the catalog is actually a mechanism to unify
these processes and fill in any gaps. For example, provide the ability to store
an MSQ ingest statement in Druid itself so the user can invoke it simply by
providing the input file names. For example:

```sql
CALL myIngestion(ARRAY[“file1.csv”, “file2.csv”])
```

The target schema is directly represented by the MSQ statement itself. That
statement defines column names and types, says whether to do rollup, and which
aggregation functions to use for each metric. Later, compaction specs either
repeat this information, or choose to revise it (to, say, perform further
rollup as data ages.) The focus is the procedure: the schema is just the result
of the procedure.

The catalog, then, is not about the datasource _schema_, but rather about
_tasks_ performed on the datasource. The schema emerges as the result of these
tasks. In this view, the user need not know or care about the schema.

The catalog, then, is a set of tasks that might include:

* MSQ ingestion query (or queries) for batch ingestion
* Supervisor for streaming ingestion
* Compaction rules
* Segment load rules
* Data aging rules, including tasks to aggregate data at various ages, cold
tier, and data expiration

The catalog, via SQL DDL statements is simply a wrapper around these
existing concepts. The one new bit might be the ability to store an MSQ query
within Druid, as suggested above.

Schema in this model, is a physical detail which emerges from the procedures
the user chooses to run. If the user wants to influence the schema, he/she does
so by adjusting a task: adding a `CAST` to an MSQ ingestion, etc. This is, in
fact, how Druid works today, without the catalog, so all we’re doing is
recognizing we were already correct and nothing need change regarding schema.

Applying this concept to rollup datasources, a datasource has rollup enabled
simply because the ingestion or compaction procedure that created the segments
created segments with rollup. The columns that store intermediate values do so
simply because the user used aggregate functions in their ingestion procedure.
This view recognizes that rollup is an implementation detail and may change
over time. It is the process that is key, not the shape of the segments.

Since queries are written based on a knowledge of the input data source,
queries reflect the decisions made when creating that input source. In fact,
ideally, the input source structure already reflects how that data should be
queried: again all Druid has to do I preserve that structure. Druid already
tracks dimensions and metrics, so queries already have that information
available. There is no need to introduce the concept of a schema.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #13816: Improve MSQ Rollup Experience with Catalog Integration

Reply via email to