paul-rogers commented on issue #13816:
URL: https://github.com/apache/druid/issues/13816#issuecomment-1475424478

   To complete the catalog, we have to decide on the user experience we want 
with the catalog, specifically for rollup datasources. It is helpful to start 
with first principles: identify the overall user experience that Druid seeks to 
offer. To help us identify that experience, this note offers three broad 
approaches, each of which seems valid in its own way.
   
   ### Data Modeling
   
   At one end of the spectrum, we could focus on the business problem that the 
user wants to solve. For rollup, this is to aggregate data in a (variation of) 
the [classic data cube](https://en.wikipedia.org/wiki/OLAP_cube) model. In this 
view, rows represent business entities and columns represent attributes of 
those entities (what the original SQL model called 
[domains](https://en.wikipedia.org/wiki/Relation_(database)), and, in 
implementations are given by data types).
   
   The job of the designer is to determine the entities needed for an 
application, and to define the attributes. An attribute may be “sourceIP” 
defined as a string that represents an IPv4 or IPv6 address. Another might be 
“sentBytes” defined as an integer number of bytes. And so on.  The idea is that 
if an application has a design for a table, then the job of ingestion is to map 
incoming data into this form, and the job of querying is to produce business 
insights on top of a known data model. (For example, from the above, given a 
timestamp, we could produce a chart of bytes in per IP address per minute.)
   
   This model is well-known in the RDBMS community and arises, in part, from 
the way data has been recorded in business systems from the earliest days of 
punched cards though early database systems and onto RDBMS systems.
   
   Traditionally, a query engine would apply the [relational 
algebra](https://en.wikipedia.org/wiki/Relational_algebra) (typically starting 
from a relational calculus in SQL) to perform queries over the defined entities 
(schema). This works because the schema tells the relational engine everything 
it needs to know to figure out to work with the data.
   
   Since the system understands the meaning of the data (at least at the 
mechanical level), the system can work out how to perform most operations, 
which is typically the job of the database planner and optimizer. In systems 
that [allow defining data 
cubes](https://mondrian.pentaho.com/documentation/schema.php#Cube), the system 
also works out how to implement rollup. See also 
[IcCube](https://www.iccube.com/support/documentation/user_guide/walkthrough/walkthrough.php)
 and [this UI 
example](https://www.iccube.com/support/documentation/user_guide/walkthrough/img/wt_creating_measure_large.png).
   
   While this approach is classic, it may not be a good fit for the niche which 
Druid occupies. The premise is that the user provides the (extended) schema, 
and the system works out how to perform operations. Druid, however, solves a 
different problem for which this classic approach may not be the right approach.
   
   ### Physical Description
   
   Another view of the catalog is that it is simply a physical description of a 
datasource with any business meaning better left to the user. In this view, the 
catalog simply records the storage type of each column in those cases where the 
user wishes to fix the storage type.
   
   In normal practice, Druid chooses a storage type suitable for the actual 
data. For example, the new schemaless feature will faithfully write into 
segments the best Druid representation of the data that appears in the input 
files. MSQ chooses a data type that is the result of the input data type and 
any operations applied to that type. And so on. The catalog would simply place 
constraints on this process to, say, always produce a `double` column rather 
than sometimes inferring `double` and sometimes `long`.
   
   Since Druid is designed to be “Python-like” (any data type is automatically 
converted to any other data type), one could argue that locking down column 
storage types is of limited value.
   
   Further, it is not clear if users would be able to make useful decisions 
about column types without knowing something about the data. That is, it is 
hard to pick a type for “x”, but a bit easier to pick a type if we know it 
represents “bytesSent”. As Druid moves to schemaless ingestion, the goal is 
that the user _need not_ know about individual columns.
   
   The physical representation would be tedious for aggregates: users would 
have to know the intermediate types for each aggregate. For example, it is not 
obvious that the intermediate type for `LATEST` is `COMPLEX<PAIR<string, 
long>>`.
   
   Druid is built as a series of procedures for ingest, compaction and 
querying. Each of those include some notion of column type. Since Druid is not 
designed to obtain column types from a catalog, there is little value in 
providing that information: it would be a manual process to ensure that the 
types specified in each procedure match those in the catalog. (In systems that 
work out the types automatically, then, of course, there is value in 
centralizing the schema information.)
   
   And, as noted, there is little value in locking down the types since Druid 
is designed to work with varying types: Druid faithfully copies the type from 
input into segments. Thus, it is not clear what the user gains by taking the 
time to specify a physical schema: they get no value or work savings, it is 
just another task in addition to the existing tasks, and the user must take 
care that all the other tasks agree with the specified schema. The physical 
modeling approach would seem to offer very little value.
   
   ### Process Modeling
   
   Druid is not a traditional data warehouse. Druid is not typically used for 
ETL. More often, Druid is a caching layer that loads existing data into a form 
that accelerates querying. In this light, one could argue that data modeling 
decisions were already made when producing the source data: they don’t have to 
be made again in Druid. Since each Druid datasource represents a single input 
source, Druid serves the user best when it faithfully reproduces the original 
data in segment form. This is, in fact, the gist of the nested data and 
schemaless projects: to ensure a faithful reproduction.
   
   If the original input source is messy, or if the user has to enrich that 
input source, or has to map multiple input sources to a single datasource, then 
one could argue that many tools exist to assist: Spark, Flink, and more. Thus, 
the input source presented to Druid represents the final, clean, result of a 
data pipeline. Again, any data modeling decisions have already been made within 
that data pipeline.
   
   In this form, the type of data is already determined by the input source: 
Druid simply has to mirror that type. Since Druid type system is flexible, the 
user can be assured that even if Druid guesses wrong, the data can still act 
like the accepted type. (For example, if a value is really a double, but a 
particular ingest has only integers, there is no harm in storing that 
particular set of data as `long`: the data will be converted to `double` later 
when needed.)
   
   Of course, Druid provides expressions in all of its ingestion mechanisms: 
the user can adjust types and names as part of the ingestion procedure. Given 
that, there is no need to _also_ specify names and types in a catalog.
   
   In this view, there is little value in letting the user pick column types: 
types are implementation details. Instead, the value is in helping the user 
perform the operations needed to work with the data: ingestion, compaction, 
loading data onto the cluster, aging the data, and finally expiring the data.
   
   As it turns out, Druid has many of these operations already, as a collection 
of REST operations. So, perhaps the catalog is actually a mechanism to unify 
these processes and fill in any gaps. For example, provide the ability to store 
an MSQ ingest statement in Druid itself so the user can invoke it simply by 
providing the input file names. For example:
   
   ```sql
   CALL myIngestion(ARRAY[“file1.csv”, “file2.csv”])
   ```
   
   The target schema is directly represented by the MSQ statement itself. That 
statement defines column names and types, says whether to do rollup, and which 
aggregation functions to use for each metric. Later, compaction specs either 
repeat this information, or choose to revise it (to, say, perform further 
rollup as data ages.) The focus is the procedure: the schema is just the result 
of the procedure.
   
   The catalog, then, is not about the datasource _schema_, but rather about 
_tasks_ performed on the datasource. The schema emerges as the result of these 
tasks. In this view, the user need not know or care about the schema.
   
   The catalog, then, is a set of tasks that might include:
   
   * MSQ ingestion query (or queries) for batch ingestion
   * Supervisor for streaming ingestion
   * Compaction rules
   * Segment load rules
   * Data aging rules, including tasks to aggregate data at various ages, cold 
tier, and data expiration
   
   The catalog, via SQL DDL statements is simply a wrapper around these 
existing concepts. The one new bit might be the ability to store an MSQ query 
within Druid, as suggested above.
   
   Schema in this model, is a physical detail which emerges from the procedures 
the user chooses to run. If the user wants to influence the schema, he/she does 
so by adjusting a task: adding a `CAST` to an MSQ ingestion, etc. This is, in 
fact, how Druid works today, without the catalog, so all we’re doing is 
recognizing we were already correct and nothing need change regarding schema.
   
   Applying this concept to rollup datasources, a datasource has rollup enabled 
simply because the ingestion or compaction procedure that created the segments 
created segments with rollup. The columns that store intermediate values do so 
simply because the user used aggregate functions in their ingestion procedure. 
This view recognizes that rollup is an implementation detail and may change 
over time. It is the process that is key, not the shape of the segments.
   
   Since queries are written based on a knowledge of the input data source, 
queries reflect the decisions made when creating that input source. In fact, 
ideally, the input source structure already reflects how that data should be 
queried: again all Druid has to do I preserve that structure. Druid already 
tracks dimensions and metrics, so queries already have that information 
available. There is no need to introduce the concept of a schema.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to