[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

GitBox Tue, 07 Jun 2022 14:10:52 -0700


paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1149170931


   Work is far enough along to propose the REST API. This material (with any 
needed updates) will go into the API docs as part of the PR.
   
   ## REST API
   
   The catalog adds two sets of APIs: one for users, the other internal for use 
by Brokers. All APIs are based at `/druid/coordinator/v1/catalog`.
   
   Note that all these APIs are for the metadata catalog entries; not the 
actual datasource. The table metadata can exist before the underlying 
datasource. Also, the definition can be deleted without deleting the datasource.
   
   ### Create Table
   
   `POST {base}/tables/{dbSchema}[?ifNew=true|false]`
   
   Creates a table definition within the indicated schema. Payload is a 
`TableSpec`, defined below.
   
   The schema must be the name of a valid writable Druid to which the user has 
write access. The valid schemas at present are `druid` (for datasources) and 
`input` (for input table definitions.) We may also support `view`.
   
   The user must have write access to the underlying datasource, even if the 
datasource does not yet exist. (That is, the user requires the same permissions 
as they will require when they first ingest data into the datasource.) For 
input tables, the user must have the extended `INPUT` permission on the input 
source name.
   
   The default semantics is that the table must not yet exist; an error is 
returned if the table exists. Use `ifNew=true` to get the SQL `IF NOT EXISTS` 
semantics.
   
   ### Update Table Definition
   
   `POST {base}/tables/{dbSchema}/{name}[?version={n}]`
   
   Updates an existing table within the indicated schema, and with the given 
name. Payload is a `TableDefn`, defined below. The schema must be as described 
for Create Table above.
   
   The API supports two "levels" of synchronization. By default, the new entry 
simply overwrites the existing entry. However, if `version={n}` is included, 
then the update occurs only if the update timestamp in the current metadata DB 
record matches that given in the REST call. Using a version provides a form of 
"optimistic locking": first read the definition, make a change and send the 
update using the update time from the read. Doing this prevents accidental 
overwrites.
   
   ### Read Table
   
   `GET {base}//tables/{dbSchema}/{name}`
   
   Reads the entry for the table given by a schema and table name. The user 
must have read access to the table. Returns a 404 (NOT FOUND) if the entry does 
not exist. Remember: the metadata entry exists independent of the datasource 
itself. The result is the `TableSpec` defined below.
   
   ### Drop Table
   
   `DELETE {base}/tables/{dbSchema}/{name}[?ifExists=true|false]`
   
   Drop the catalog entry for the given table and schema. The schema must 
exist. The table must also exist, and the user must have write access to the 
underlying datasource.
   
   The optional `ifExists=true` parameter provides SQL `IF EXISTS` semantics: 
no error is given if the table does not exist.
   
   ### List Schemas
   
   `GET {base}/tables`
   
   Retrieves the list of the names of schemas known to the catalog, which 
includes the same set of schemas in the `INFORMATION_SCHEMA.SCHEMATA` table. 
Note that, in the present version, the catalog does not support user-defined 
schemas.
   
   The list is filtered based on user permissions: the list will omit schemas 
for which the user does not have read access.
   
   ### List Tables
   
   `GET {base}/tables/{dbSchema}`
   
   Retrieves the list of the names of tables within the given schema. This list 
contains only those tables for which metadata entries appear, and is thus a 
subset of those returned by `INFORMATION_SCHEMA.TABLES`. The list contains only 
those tables for which the user has read access.
   
   ### Synchronize Table (Internal)
   
   `GET {base}/sync/{dbSchema}/{name}`
   
   Retrieve the entry for a single table within the given schema as a 
`TableSpec` object. The user is assumed to be the Druid super-user. This API is 
primarily for use by the Broker node.
   
   ### Synchronize Schema (Internal)
   
   `GET {base}/sync/{dbSchema}`
   
   Returns the list of all table metadata, as `TableSpec` objects, within the 
given schema. The user is assumed to be the Druid super-user. This API is 
primarily for use by the Broker node.
   
   ### Table Specification (`TableSpec`)
   
   The `TableSpec` object holds two kinds of information: the system-defined 
metadata about the entry, and the user-provided table definition (`TableDefn`). 
All tables, regardless of kind (datasource, input, view, etc.) use the same 
`TableSpec` object: only the table definition part varies.
   
   Example:
   
   ```json
   {
      "dbSchema":"druid",
      "name":"read",
      "owner":null,
      "creationTime":1654634106432,
      "updateTime":1654634106432,
      "state":"ACTIVE",
      "defn": <TableDefn>
      }
   }
   ```
   
   Fields:
   
   * `dbSchema`: Schema name
   * `name`: Table name
   * `owner`: SQL owner. Not yet supported.
   * `creationTime`: UTC Timestamp (in milliseconds since the epoch) of the 
original creation time.
   * `updateTime`: UTC Timestamp (in milliseconds since the epoch) of the most 
recent creation or update. This is the record's "version" when using optimistic 
locking.
   *  `state`: For datasources. Normally `ACTIVE`. May be `DELETING` if 
datasource deletion is in progress. (Not yet supported.)
   * `defn`: The user-defined table definition as described below.
   
   When creating a table, omit the all fields except `name` and `defn`. The 
owner, when supported, is the user that creates the entry.
   
   ### Table Definition (`TableDefn`)
   
   The table definition holds the user-provided information for a catalog entry 
for the table. The type of the definition depends on the type of the table. The 
table definitions are preliminary and subject to change. (Let's make that 
stronger: the definitions _will_ change.)
   
   The common parts of all definitions include:
   
   ```json
   "defn": {
     "type": "<type>",
     "properties": {  }
   }
   ```
   
   Fields:
   
   * `type`: Type of the table: one of the types defined below.
   * `properties`: A generic key/value list of user or extension-defined 
properties. Allows the application to attach application-specific data to the 
table. Druid itself ignores these properties other than to save them and return 
them via the API.
   
   #### Datasource Definition
   
   The datasource definition provides the values to use on ingestion and 
initial compaction. Druid rules may change these properties as the data ages. 
At present, such rules are _not_ reflected in the metadata catalog. Example:
   
   ```json
   {
      "type":"datasource",
      "segmentGranularity": "PT1H",
      "rollupGranularity": "PT1M",
      "targetSegmentRows": 1000000,
      "columns": [ ... ],
      "properties": { ... }
   }
   ```
   
   Fields (excluding the common fields):
   
   * `segmentGranularity`: The segment size, using one of Druid's supported 
sizes. Equivalent to the `PARTITION BY` clause on `INSERT` statements.
   * `rollupGranularity`: If `null` (or omitted), then this is a detail 
datasource. If provided, then this is a rollup datasource with the ingest-time 
rollup given here. The value must be one of Druid's valid rollup granularities. 
Replaces the `GROUP BY` clause in an `INSERT` statement.
   * `targetSegmentRows`: The number of rows to target per segment (when there 
is enough data for segment size to matter.) Defaults to Druid's usual value of 
5 million rows. Change the value only for extreme cases, such as very large 
rows, or other tuning needs.
   * `columns`: List of columns. Need not include all the actual data source 
columns. Those listed here are expanded in the `SELECT *` statement in the 
order defined here; others appear after in alphabetical order. Columns are of 
various types depending on the data source.
   
   TODO: Add column definitions.
   
   More properties will come as work proceeds.
   
   #### Input Source Definition
   
   Defines an input source for `INSERT` and `SELECT` statements. An input 
source is equivalent to an `EXTERN` definition, but with the definition stored 
in the catalog.
   
   At present, all values are defined in the catalog. Later work will allow 
parameters for things like the list of file names.
   
   Example for a CSV file:
   
   ```json
   {
      "type":"input",
      "inputSource":{
         "type":"inline",
         "data":"a,b,1\nc,d,2\n"
      },
      "format":{
         "type":"csv",
         "columns":[
            "x",
            "y",
            "z"
         ],
         "listDelimiter":null,
         "findColumnsFromHeader":false,
         "skipHeaderRows":0
      },
      "columns":[
         {
            "type":"InputColumnDefn",
            "name":"a",
            "sqlType":"varchar"
         }
      ],
      "properties": { ... }
   }
   ```
   
   The three top-level fields are:
   
   * `inputSource`: Druid Input source. See [the 
docs](https://druid.apache.org/docs/latest/ingestion/data-formats.html#input-format).
   * `format`: Druid input format. See [the 
docs](https://druid.apache.org/docs/latest/ingestion/data-formats.html#input-format).
   * `columns`: Corresponds to the `columns` field in Druid ingestion, but 
using the column format defined here.
   
   #### CSV Column Format
   
   CSV columns appear in the order of columns in the CSV file. The three fields 
are:
   
   * `type`: Must be `InputColumnDefn`.
   * `name`: The CSV column name. If the table has headers, this should 
generally be the same as the header name.
   * `sqlType`: The SQL type Druid should use for the column. Defaults to 
`VARCHAR`. The CSV data for the column must be convertible to the provided type.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

Reply via email to