paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1279468615
@clintropolis notes:
> these *are not* the only Druid storage types, the current proposal is only
able to model a rather small subset of the types which can appear in Druid
segment
The intention is that a combination of the column spec and column type
provides a description of all possible column types. Sorry if that was not
clear: the focus in the aggregate section was on, well, aggregates. I just
hadn't gotten far enough to need to deal with the others yet.
One constraint I want us to keep in mind is that we'd like to eventually
allow DDL statements something like:
```sql
CREATE ROLLUP TABLE foo (
__time TIMESTAMP,
a IP_ADDRESS,
b ARRAY(STRING),
c SUM(LONG),
d STRUCT(e STRING, f DOUBLE),
g VARCHAR WITH COMPACT INDEX
)
PARTITION BY DAY
CLUSTER BY a, g
```
So, the type names have to be SQL-like and SQL-parsable.
With a _bit_ more research on complex types, it sounds like we have three
categories:
* Types which are a single "thing" logically, but which have internal
structure: e.g. IP address
* Types which represent aggregates. They may be stored simply (just a
`long`), but have additional semantics which the user has to provide on ingest,
compaction and query. To turn compaction into "auto-compaction", we store the
aggregation in segments, but that is invisible to users.
* Structured types, where the user knows about the structure, such as JSON
objects an arrays.
My proposal (since withdrawn until we rethink it) is:
* Externally simple, internally composite types just takes a SQL name:
`IP_ADDRESS`.
* Aggregates are defined in terms of their aggregate function and operand
type, as in segment metadata: `SUM(LONG)`.
* User-visible complex types provide the required type information:
`ARRAY(LONG)`.
There is no good answer for user-visible structures because those are not
part of the SQL domain of discourse. There is an ill-fated project,
[SQL++](https://arxiv.org/abs/1405.3631) that tried to find a solution. Seems
it was adopted by [Apache
Asterix](https://asterixdb.apache.org/docs/0.9.3/sqlpp/manual.html) and
[CouchBase](https://www.couchbase.com/sqlplusplus).
In Drill, we handled the types outside of SQL by using (an earlier version
of) an
[Arrow-like](https://arrow.apache.org/docs/format/Columnar.html#struct-layout)
format. The current thinking is to adapt that pattern to be more SQL and
Druid-like for use in the catalog, and in eventual SQL DDL statements. For
example we could invent syntax such as `STRUCT(a STRUCT(b BIGINT, c VARCHAR), d
DOUBLE)`.
Array columns can be represented similarly: `ARRAY(DOUBLE)`, say. FWIW,
[Arrow
uses](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layou)`list<double>`.
For the first [catalog PR](https://github.com/apache/druid/pull/13165), the
types are "to be named later": we're just focusing on storing the type names,
whatever we decide they are. This gives us time to continue the type name
discussion.
The catalog proposes using a different column "kind" for dimensions and
measures. (Where "kind" is the Jackson type field in the JSON.) In this way, we
know the difference between a complex dimension (such as `IP_ADDRESS`) and a
complex measure (`SUM(LONG)`). If there are types that can be both a dimension
and a measure (are both aggregates and not), then the column "kind" would
disambiguate.
The kind, by the way, allows us to specify other per-column-kind
information. For example, if there are multiple choices for an index type for
dimensions, that would be a dimension-only attribute of a column as suggested
in the DDL sketch above.
Anyway, the point is taken: we do need a full design for all column types.
I'll work up something.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]