[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

GitBox Sat, 21 May 2022 16:04:29 -0700


paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1133780681


   @FrankChen021, thanks for your encouragement!  Yes, I hope that borrowing a 
bit of RDBMS functionality we can make Druid a bit easier to use by letting 
Druid keep track of its own details.
   
   You asked about INFORMATION_SCHEMA, so let's address that:
   
   ## INFORMATION SCHEMA Revisions
   
   Druid's `INFORMATION_SCHEMA` schema is an evolution of a Calcite concept: it 
presents the view of the underlying Druid metadata that is used by the SQL 
engine to plan queries. For example, segments for a datasource might have 
multiple types: long in one segment, string in a newer segment. Druid picks one 
for use in SQL. In this case, it will pick string (presented to SQL as 
`VARCHAR`) since that is what appears in the newest segments.
   
   Since `INFORMATION_SCHEMA` is a "view", it is immutable. We retain this idea 
in the catalog feature. Instead, we modify the `INFORMATION_SCHEMA` results to 
show the effect of applying catalog information on top of the information 
gleaned from segments. For example, if the user decided that the above field 
really should be a long, they'd specify it as the SQL `BIGINT` type, which 
would override Druid's type inference rules. The type for the column would then 
appear as `BIGINT` in the `INFORMATION_SCHEMA.COLUMNS` record for that column.
   
   We may add a "hide" option to columns to mark a column that exists in 
segments, but, for whatever reason, is unwanted. A hidden column would not 
appear in `INFORMATION_SCHEMA.COLUMNS` since it is not available to SQL.
   
   ### Clean-up
   
   Since Druid inherited `INFORMATION_SCHEMA` from Calcite, it picked up some 
columns that actually have no meaning in Druid. New users have to learn that, 
though the columns exist, they don't do anything, which is annoying. So, we 
propose to do a bit of house-cleaning.
   
   ### Renove Unused `SCHEMATA Columns
   
   We sill remove the following columns from `INFORMATION_SCHEMA.SCHEMATA`:
   
   * `CATALOG_NAME`
   
   Druid barely supports one level of schema, it is unlikely to support two 
levels (a catalog and a schema).
   
   * `DEFAULT_CHARACTER_SET_CATALOG`
   * `DEFAULT_CHARACTER_SET_SCHEMA`
   * `DEFAULT_CHARACTER_SET_NAME`
   
   These are removed because Druid always works in UTF-8. We leave two 
additional columns because they *may* be useful as we work out catalog details:
   
   * `SCHEMA_OWNER` - useful if Druid were to support temporary tables: such 
tables would reside in a schema owned by a specific user. There are no plans to 
add such a feature at the moment, but such a feature is common in SQL systems.
   * `SQL_PATH` - Not sure what Calcite uses this for, but it might be handy to 
allow aliases: the ability to "rename" a table by creating an alias known to 
SQL, while the native layer uses the original name. Again, there are no plans 
to add such a feature now, but it could be a way to overcome the "no rename" 
limitation in Druid.
   
   Then, we tidy up `TABLES` and `COLUMNS`:
   
   * `TABLE_CATALOG` 
   
   Removed as described above since these columns at as a foreign key into the 
to-be-remove `CATALOG_NAME` in `SCHEMATA`.
   
   This is a breaking change only if a client is currently using both the 
`CATALOG_NAME` and `SCHEMA_NAME` when doing a join between tables and schemata, 
or between columns and tables. Since it is a potential breaking change, we'll 
want to consider the implications.
   
   ### Remove Unused `COLUMNS` Columns
   
   In `COLUMNS`, we propose to remove:
   
   * `CHARACTER_MAXIMUM_LENGTH` - Used for `CHAR(x)` and `VARCHAR(x)` which 
Druid does not support.
   * `CHARACTER_OCTET_LENGTH` - As above.
   * `NUMERIC_PRECISION` - Used for `DECIMAL(x, y)` which Druid does not 
support.
   * `NUMERIC_PRECISION_RADIX` - As above.
   * `NUMERIC_SCALE` - As above
   * `DATETIME_PRECISION` - Used for SQL DateTime and Interval types which 
specify a range, which Druid does not support.
   * `CHARACTER_SET_NAME` - Not needed: Druid always uses UTF-8.
   * `COLLATION_NAME` - Not used, Druid does not support different collation 
types the way that, say, MySQL does.
   
   ### Add Druid-Specific Table Metadata
   
   For Tables, we *may* want to add additional useful Druid information:
   
   * Segment grain - Since segment grain can change over time, this would be 
the ingest grain only, not coarser grains that users may apply as data ages.
   * Rollup grain - Null if the table is a detail datasource, else the rollup 
(AKA query) grain used at ingestion. Again, this grain can change over time 
(and different ingestion tasks can choose a different grain), but this is the 
one that would be used by the SQL-based `INSERT` statement if the user does not 
specify one.
   * Is rollup - Possible addition: true if the datasource does rollup on 
ingest (and thus the rollup grain is non-null), false otherwise. A datasource 
with rollup may have columns with the aggregate types: datasources without 
rollup will not. (Again, this applies only to ingestion time, not any rollup 
introduced as data ages.)
   
   The above are needed so the SQL planner knows how to plan `INSERT` 
statements against rollup tables. Since the SQL planner will use this 
information, `INFORMATION_SCHEMA` should present it so the user sees what the 
planner uses.
   
   To handle aging-based changes in datasources, we *could* introduce another 
table that provides these rules, but that is out of cope for this project.
   
   ### Add Druid-Specific Column Metadata
   
   For `COLUMNS` we *may* add additional columns to express Druid attributes. 
This is very preliminary:
   
   * Dimension options (such as indexing options)
   * Measure (metric) aggregation types (more type detail than just `COMPLEX`). 
This would show that `SUM(BIGINT)` or `LATEST(VARCHAR)` idea discussed above.
   
   ### Replace Ad-hoc String Columns with Boolean Values
   
   It would also be handy to modify the Boolean columns to be actual Booleans:
   
   * `TABLES.IS_JOINABLE`
   * `TABLES.IS_BROADCAST`
   * `COLUMNS.IS_NULLABLE`
   
   Rather than show `"YES"` and `"NO"` (which every client has to know how to 
parse. Not even `"TRUE"`/`"FALSE"`?), instead use the common Boolean encoding 
of `TRUE = 1/FALSE = 0`. Report the JDBC type as `BOOLEAN`, the `DATA_TYPE` as 
`INT`.
   
   The above would be a breaking change for any tool which is currently doing 
its own parsing, and so requires consideration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

Reply via email to