[GitHub] [druid] clintropolis commented on issue #12546: Druid Catalog Proposal

GitBox Tue, 26 Jul 2022 20:58:10 -0700


clintropolis commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1196237863


   >thanks for the head's up on the complex types. Can you point me to 
documentation on the details of the type? To any SQL support we already have?
   
   (heh, I think you tagged the wrong person in your comments, sorry other 
@clint 😅 ). Nested data columns are described in proposal #12695 and PR #12753. 
They are wired up to SQL, though I'm mostly just using them as an example. Like 
all complex types is currently handled in a more or less opaque manner 
(functions which know how to deal with `COMPLEX<json>` do things with it, 
things that aren't aware do not). This was maybe not a great example because 
I'm considering making this stuff into top level native Druid types, though it 
would most likely be in the addition of both `VARIANT` and `STRUCT` (or `MAP` 
or something similar), since if it were done entirely with native types the 
current `COMPLEX<json>` is effectively whatever type it encounters (so might be 
normal scalar types `LONG`, `STRING`, etc; a `VARIANT` type; a nested type 
`STRUCT`, arrays of primitives (`ARRAY<LONG>`); arrays of objects 
`ARRAY<STRUCT>`, nested arrays `ARRAY<ARRAY<STRUCT>>` and so on).
   
   Complex types can be defined as dimensions or metrics, so we can't count on 
defining them all in terms of aggregators.
   
   Internally, we currently build the SQL schemas for Calcite with 
[`DruidTable`](https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/table/DruidTable.java)
 which represents the schema with a 
[`RowSignature`](https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/column/RowSignature.java)
 which is defined using Druid native types which it collects from 
SegmentMetadata queries. Complex types are represented internally in Calcite 
with 
[`ComplexSqlType`](https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/table/RowSignatures.java#L169),
  whenever it is necessary to represent them as an actual SQL type, though this 
is a relatively new construct that isn't used everywhere yet (since many of our 
functions which have complex inputs and outputs that predate this construct at 
the calcite level will use the `ANY` and `OTHER` sql types and defer actual 
validation that it is the c
 orrect complex type until translated to native Druid query which can check 
against the native Druid types in the RowSignature of the table).
   
   > My suggestion is to enforce a limited set of types: VARCHAR, BIGINT, FLOAT 
and DOUBLE, which directly correspond to the Druid storage types.
   
   This is my main point, these _are not_ the only Druid storage types, the 
current proposal is only able to model a rather _small_ subset of the types 
which can appear in Druid segments. The complex type system is extensible, 
meaning there is potentially a large set of complex types based on what set of 
extensions is loaded. Internally these are all basically opaque, which is why 
we have the generic `COMPLEX<typeName>` json representation of the native type, 
which we use to extract the `typeName` and can lookup the handlers for that 
type. Many of these types are tied to aggregators, but multiple aggregators can 
make the same type, and many aggregators support ingesting pre-existing 
(usually binary) format values. I think we need something generic like 
`COMPLEX<typeName>` that we use for the native types so that we can retain the 
`typeName` so that the functions can perform validation and provide meaningful 
error messages when using a `COMPLEX<thetaSketch>` input on a function that e
 xpects `COMPLEX<HLLSketch>` or whatever, and then in the native layer to 
choose the correct set of handlers for the type. Otherwise every single complex 
type will need to devise a way for the catalog to recognize it, which sounds 
like a lot of work for rather low utility.
   
   There will also likely be `ARRAY` typed columns in the near future, so we'll 
need to be able to be sure we can model those as well, where I guess if it 
handles stuff like `VARCHAR ARRAY` it would be fine as currently proposed, 
though i've seen other ways of defining array types in the wild (looks at 
bigquery) so i'm not sure how hard the standard is here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on issue #12546: Druid Catalog Proposal

Reply via email to