paul-rogers commented on issue #13816: URL: https://github.com/apache/druid/issues/13816#issuecomment-1475424478
To complete the catalog, we have to decide on the user experience we want with the catalog, specifically for rollup datasources. It is helpful to start with first principles: identify the overall user experience that Druid seeks to offer. To help us identify that experience, this note offers three broad approaches, each of which seems valid in its own way. ### Data Modeling At one end of the spectrum, we could focus on the business problem that the user wants to solve. For rollup, this is to aggregate data in a (variation of) the [classic data cube](https://en.wikipedia.org/wiki/OLAP_cube) model. In this view, rows represent business entities and columns represent attributes of those entities (what the original SQL model called [domains](https://en.wikipedia.org/wiki/Relation_(database)), and, in implementations are given by data types). The job of the designer is to determine the entities needed for an application, and to define the attributes. An attribute may be “sourceIP” defined as a string that represents an IPv4 or IPv6 address. Another might be “sentBytes” defined as an integer number of bytes. And so on. The idea is that if an application has a design for a table, then the job of ingestion is to map incoming data into this form, and the job of querying is to produce business insights on top of a known data model. (For example, from the above, given a timestamp, we could produce a chart of bytes in per IP address per minute.) This model is well-known in the RDBMS community and arises, in part, from the way data has been recorded in business systems from the earliest days of punched cards though early database systems and onto RDBMS systems. Traditionally, a query engine would apply the [relational algebra](https://en.wikipedia.org/wiki/Relational_algebra) (typically starting from a relational calculus in SQL) to perform queries over the defined entities (schema). This works because the schema tells the relational engine everything it needs to know to figure out to work with the data. Since the system understands the meaning of the data (at least at the mechanical level), the system can work out how to perform most operations, which is typically the job of the database planner and optimizer. In systems that [allow defining data cubes](https://mondrian.pentaho.com/documentation/schema.php#Cube), the system also works out how to implement rollup. See also [IcCube](https://www.iccube.com/support/documentation/user_guide/walkthrough/walkthrough.php) and [this UI example](https://www.iccube.com/support/documentation/user_guide/walkthrough/img/wt_creating_measure_large.png). While this approach is classic, it may not be a good fit for the niche which Druid occupies. The premise is that the user provides the (extended) schema, and the system works out how to perform operations. Druid, however, solves a different problem for which this classic approach may not be the right approach. ### Physical Description Another view of the catalog is that it is simply a physical description of a datasource with any business meaning better left to the user. In this view, the catalog simply records the storage type of each column in those cases where the user wishes to fix the storage type. In normal practice, Druid chooses a storage type suitable for the actual data. For example, the new schemaless feature will faithfully write into segments the best Druid representation of the data that appears in the input files. MSQ chooses a data type that is the result of the input data type and any operations applied to that type. And so on. The catalog would simply place constraints on this process to, say, always produce a `double` column rather than sometimes inferring `double` and sometimes `long`. Since Druid is designed to be “Python-like” (any data type is automatically converted to any other data type), one could argue that locking down column storage types is of limited value. Further, it is not clear if users would be able to make useful decisions about column types without knowing something about the data. That is, it is hard to pick a type for “x”, but a bit easier to pick a type if we know it represents “bytesSent”. As Druid moves to schemaless ingestion, the goal is that the user _need not_ know about individual columns. The physical representation would be tedious for aggregates: users would have to know the intermediate types for each aggregate. For example, it is not obvious that the intermediate type for `LATEST` is `COMPLEX<PAIR<string, long>>`. Druid is built as a series of procedures for ingest, compaction and querying. Each of those include some notion of column type. Since Druid is not designed to obtain column types from a catalog, there is little value in providing that information: it would be a manual process to ensure that the types specified in each procedure match those in the catalog. (In systems that work out the types automatically, then, of course, there is value in centralizing the schema information.) And, as noted, there is little value in locking down the types since Druid is designed to work with varying types: Druid faithfully copies the type from input into segments. Thus, it is not clear what the user gains by taking the time to specify a physical schema: they get no value or work savings, it is just another task in addition to the existing tasks, and the user must take care that all the other tasks agree with the specified schema. The physical modeling approach would seem to offer very little value. ### Process Modeling Druid is not a traditional data warehouse. Druid is not typically used for ETL. More often, Druid is a caching layer that loads existing data into a form that accelerates querying. In this light, one could argue that data modeling decisions were already made when producing the source data: they don’t have to be made again in Druid. Since each Druid datasource represents a single input source, Druid serves the user best when it faithfully reproduces the original data in segment form. This is, in fact, the gist of the nested data and schemaless projects: to ensure a faithful reproduction. If the original input source is messy, or if the user has to enrich that input source, or has to map multiple input sources to a single datasource, then one could argue that many tools exist to assist: Spark, Flink, and more. Thus, the input source presented to Druid represents the final, clean, result of a data pipeline. Again, any data modeling decisions have already been made within that data pipeline. In this form, the type of data is already determined by the input source: Druid simply has to mirror that type. Since Druid type system is flexible, the user can be assured that even if Druid guesses wrong, the data can still act like the accepted type. (For example, if a value is really a double, but a particular ingest has only integers, there is no harm in storing that particular set of data as `long`: the data will be converted to `double` later when needed.) Of course, Druid provides expressions in all of its ingestion mechanisms: the user can adjust types and names as part of the ingestion procedure. Given that, there is no need to _also_ specify names and types in a catalog. In this view, there is little value in letting the user pick column types: types are implementation details. Instead, the value is in helping the user perform the operations needed to work with the data: ingestion, compaction, loading data onto the cluster, aging the data, and finally expiring the data. As it turns out, Druid has many of these operations already, as a collection of REST operations. So, perhaps the catalog is actually a mechanism to unify these processes and fill in any gaps. For example, provide the ability to store an MSQ ingest statement in Druid itself so the user can invoke it simply by providing the input file names. For example: ```sql CALL myIngestion(ARRAY[“file1.csv”, “file2.csv”]) ``` The target schema is directly represented by the MSQ statement itself. That statement defines column names and types, says whether to do rollup, and which aggregation functions to use for each metric. Later, compaction specs either repeat this information, or choose to revise it (to, say, perform further rollup as data ages.) The focus is the procedure: the schema is just the result of the procedure. The catalog, then, is not about the datasource _schema_, but rather about _tasks_ performed on the datasource. The schema emerges as the result of these tasks. In this view, the user need not know or care about the schema. The catalog, then, is a set of tasks that might include: * MSQ ingestion query (or queries) for batch ingestion * Supervisor for streaming ingestion * Compaction rules * Segment load rules * Data aging rules, including tasks to aggregate data at various ages, cold tier, and data expiration The catalog, via SQL DDL statements is simply a wrapper around these existing concepts. The one new bit might be the ability to store an MSQ query within Druid, as suggested above. Schema in this model, is a physical detail which emerges from the procedures the user chooses to run. If the user wants to influence the schema, he/she does so by adjusting a task: adding a `CAST` to an MSQ ingestion, etc. This is, in fact, how Druid works today, without the catalog, so all we’re doing is recognizing we were already correct and nothing need change regarding schema. Applying this concept to rollup datasources, a datasource has rollup enabled simply because the ingestion or compaction procedure that created the segments created segments with rollup. The columns that store intermediate values do so simply because the user used aggregate functions in their ingestion procedure. This view recognizes that rollup is an implementation detail and may change over time. It is the process that is key, not the shape of the segments. Since queries are written based on a knowledge of the input data source, queries reflect the decisions made when creating that input source. In fact, ideally, the input source structure already reflects how that data should be queried: again all Druid has to do I preserve that structure. Druid already tracks dimensions and metrics, so queries already have that information available. There is no need to introduce the concept of a schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
