paul-rogers commented on issue #12546: URL: https://github.com/apache/druid/issues/12546#issuecomment-1133780681
@FrankChen021, thanks for your encouragement! Yes, I hope that borrowing a bit of RDBMS functionality we can make Druid a bit easier to use by letting Druid keep track of its own details. You asked about INFORMATION_SCHEMA, so let's address that: ## INFORMATION SCHEMA Revisions Druid's `INFORMATION_SCHEMA` schema is an evolution of a Calcite concept: it presents the view of the underlying Druid metadata that is used by the SQL engine to plan queries. For example, segments for a datasource might have multiple types: long in one segment, string in a newer segment. Druid picks one for use in SQL. In this case, it will pick string (presented to SQL as `VARCHAR`) since that is what appears in the newest segments. Since `INFORMATION_SCHEMA` is a "view", it is immutable. We retain this idea in the catalog feature. Instead, we modify the `INFORMATION_SCHEMA` results to show the effect of applying catalog information on top of the information gleaned from segments. For example, if the user decided that the above field really should be a long, they'd specify it as the SQL `BIGINT` type, which would override Druid's type inference rules. The type for the column would then appear as `BIGINT` in the `INFORMATION_SCHEMA.COLUMNS` record for that column. We may add a "hide" option to columns to mark a column that exists in segments, but, for whatever reason, is unwanted. A hidden column would not appear in `INFORMATION_SCHEMA.COLUMNS` since it is not available to SQL. ### Clean-up Since Druid inherited `INFORMATION_SCHEMA` from Calcite, it picked up some columns that actually have no meaning in Druid. New users have to learn that, though the columns exist, they don't do anything, which is annoying. So, we propose to do a bit of house-cleaning. ### Renove Unused `SCHEMATA Columns We sill remove the following columns from `INFORMATION_SCHEMA.SCHEMATA`: * `CATALOG_NAME` Druid barely supports one level of schema, it is unlikely to support two levels (a catalog and a schema). * `DEFAULT_CHARACTER_SET_CATALOG` * `DEFAULT_CHARACTER_SET_SCHEMA` * `DEFAULT_CHARACTER_SET_NAME` These are removed because Druid always works in UTF-8. We leave two additional columns because they *may* be useful as we work out catalog details: * `SCHEMA_OWNER` - useful if Druid were to support temporary tables: such tables would reside in a schema owned by a specific user. There are no plans to add such a feature at the moment, but such a feature is common in SQL systems. * `SQL_PATH` - Not sure what Calcite uses this for, but it might be handy to allow aliases: the ability to "rename" a table by creating an alias known to SQL, while the native layer uses the original name. Again, there are no plans to add such a feature now, but it could be a way to overcome the "no rename" limitation in Druid. Then, we tidy up `TABLES` and `COLUMNS`: * `TABLE_CATALOG` Removed as described above since these columns at as a foreign key into the to-be-remove `CATALOG_NAME` in `SCHEMATA`. This is a breaking change only if a client is currently using both the `CATALOG_NAME` and `SCHEMA_NAME` when doing a join between tables and schemata, or between columns and tables. Since it is a potential breaking change, we'll want to consider the implications. ### Remove Unused `COLUMNS` Columns In `COLUMNS`, we propose to remove: * `CHARACTER_MAXIMUM_LENGTH` - Used for `CHAR(x)` and `VARCHAR(x)` which Druid does not support. * `CHARACTER_OCTET_LENGTH` - As above. * `NUMERIC_PRECISION` - Used for `DECIMAL(x, y)` which Druid does not support. * `NUMERIC_PRECISION_RADIX` - As above. * `NUMERIC_SCALE` - As above * `DATETIME_PRECISION` - Used for SQL DateTime and Interval types which specify a range, which Druid does not support. * `CHARACTER_SET_NAME` - Not needed: Druid always uses UTF-8. * `COLLATION_NAME` - Not used, Druid does not support different collation types the way that, say, MySQL does. ### Add Druid-Specific Table Metadata For Tables, we *may* want to add additional useful Druid information: * Segment grain - Since segment grain can change over time, this would be the ingest grain only, not coarser grains that users may apply as data ages. * Rollup grain - Null if the table is a detail datasource, else the rollup (AKA query) grain used at ingestion. Again, this grain can change over time (and different ingestion tasks can choose a different grain), but this is the one that would be used by the SQL-based `INSERT` statement if the user does not specify one. * Is rollup - Possible addition: true if the datasource does rollup on ingest (and thus the rollup grain is non-null), false otherwise. A datasource with rollup may have columns with the aggregate types: datasources without rollup will not. (Again, this applies only to ingestion time, not any rollup introduced as data ages.) The above are needed so the SQL planner knows how to plan `INSERT` statements against rollup tables. Since the SQL planner will use this information, `INFORMATION_SCHEMA` should present it so the user sees what the planner uses. To handle aging-based changes in datasources, we *could* introduce another table that provides these rules, but that is out of cope for this project. ### Add Druid-Specific Column Metadata For `COLUMNS` we *may* add additional columns to express Druid attributes. This is very preliminary: * Dimension options (such as indexing options) * Measure (metric) aggregation types (more type detail than just `COMPLEX`). This would show that `SUM(BIGINT)` or `LATEST(VARCHAR)` idea discussed above. ### Replace Ad-hoc String Columns with Boolean Values It would also be handy to modify the Boolean columns to be actual Booleans: * `TABLES.IS_JOINABLE` * `TABLES.IS_BROADCAST` * `COLUMNS.IS_NULLABLE` Rather than show `"YES"` and `"NO"` (which every client has to know how to parse. Not even `"TRUE"`/`"FALSE"`?), instead use the common Boolean encoding of `TRUE = 1/FALSE = 0`. Report the JDBC type as `BOOLEAN`, the `DATA_TYPE` as `INT`. The above would be a breaking change for any tool which is currently doing its own parsing, and so requires consideration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
