[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

GitBox Fri, 14 Oct 2022 14:08:39 -0700


paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1279468615


   @clintropolis notes:
   
   > these *are not* the only Druid storage types, the current proposal is only 
able to model a rather small subset of the types which can appear in Druid 
segment
   
   The intention is that a combination of the column spec and column type 
provides a description of all possible column types. Sorry if that was not 
clear: the focus in the aggregate section was on, well, aggregates. I just 
hadn't gotten far enough to need to deal with the others yet.
   
   One constraint I want us to keep in mind is that we'd like to eventually 
allow DDL statements something like:
   
   ```sql
   CREATE ROLLUP TABLE foo (
     __time TIMESTAMP,
    a IP_ADDRESS,
    b ARRAY(STRING),
    c SUM(LONG),
    d STRUCT(e STRING, f DOUBLE),
    g VARCHAR WITH COMPACT INDEX
   )
   PARTITION BY DAY
   CLUSTER BY a, g
   ```
   
   So, the type names have to be SQL-like and SQL-parsable.
   
   With a _bit_ more research on complex types, it sounds like we have three 
categories:
   
   * Types which are a single "thing" logically, but which have internal 
structure: e.g. IP address
   * Types which represent aggregates. They may be stored simply (just a 
`long`), but have additional semantics which the user has to provide on ingest, 
compaction and query. To turn compaction into "auto-compaction", we store the 
aggregation in segments, but that is invisible to users.
   * Structured types, where the user knows about the structure, such as JSON 
objects an arrays.
   
   My proposal (since withdrawn until we rethink it) is:
   
   * Externally simple, internally composite types just takes a SQL name: 
`IP_ADDRESS`.
   * Aggregates are defined in terms of their aggregate function and operand 
type, as in segment metadata: `SUM(LONG)`.
   *  User-visible complex types provide the required type information: 
`ARRAY(LONG)`.
   
   There is no good answer for user-visible structures because those are not 
part of the SQL domain of discourse. There is an ill-fated project, 
[SQL++](https://arxiv.org/abs/1405.3631) that tried to find a solution. Seems 
it was adopted by [Apache 
Asterix](https://asterixdb.apache.org/docs/0.9.3/sqlpp/manual.html) and 
[CouchBase](https://www.couchbase.com/sqlplusplus).
   
   In Drill, we handled the types outside of SQL by using (an earlier version 
of) an 
[Arrow-like](https://arrow.apache.org/docs/format/Columnar.html#struct-layout) 
format. The current thinking is to adapt that pattern to be more SQL and 
Druid-like for use in the catalog, and in eventual SQL DDL statements. For 
example we could invent syntax such as `STRUCT(a STRUCT(b BIGINT, c VARCHAR), d 
DOUBLE)`.
   
   Array columns can be represented similarly: `ARRAY(DOUBLE)`, say. FWIW, 
[Arrow 
uses](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layou)`list<double>`.
   
   For the first [catalog PR](https://github.com/apache/druid/pull/13165), the 
types are "to be named later": we're just focusing on storing the type names, 
whatever we decide they are. This gives us time to continue the type name 
discussion.
   
   The catalog proposes using a different column "kind" for dimensions and 
measures. (Where "kind" is the Jackson type field in the JSON.) In this way, we 
know the difference between a complex dimension (such as `IP_ADDRESS`) and a 
complex measure (`SUM(LONG)`). If there are types that can be both a dimension 
and a measure (are both aggregates and not), then the column "kind" would 
disambiguate.
   
   The kind, by the way, allows us to specify other per-column-kind 
information. For example, if there are multiple choices for an index type for 
dimensions, that would be a dimension-only attribute of a column as suggested 
in the DDL sketch above.
   
   Anyway, the point is taken: we do need a full design for all column types. 
I'll work up something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

Reply via email to