paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1149170931
Work is far enough along to propose the REST API. This material (with any
needed updates) will go into the API docs as part of the PR.
## REST API
The catalog adds two sets of APIs: one for users, the other internal for use
by Brokers. All APIs are based at `/druid/coordinator/v1/catalog`.
Note that all these APIs are for the metadata catalog entries; not the
actual datasource. The table metadata can exist before the underlying
datasource. Also, the definition can be deleted without deleting the datasource.
### Create Table
`POST {base}/tables/{dbSchema}[?ifNew=true|false]`
Creates a table definition within the indicated schema. Payload is a
`TableSpec`, defined below.
The schema must be the name of a valid writable Druid to which the user has
write access. The valid schemas at present are `druid` (for datasources) and
`input` (for input table definitions.) We may also support `view`.
The user must have write access to the underlying datasource, even if the
datasource does not yet exist. (That is, the user requires the same permissions
as they will require when they first ingest data into the datasource.) For
input tables, the user must have the extended `INPUT` permission on the input
source name.
The default semantics is that the table must not yet exist; an error is
returned if the table exists. Use `ifNew=true` to get the SQL `IF NOT EXISTS`
semantics.
### Update Table Definition
`POST {base}/tables/{dbSchema}/{name}[?version={n}]`
Updates an existing table within the indicated schema, and with the given
name. Payload is a `TableDefn`, defined below. The schema must be as described
for Create Table above.
The API supports two "levels" of synchronization. By default, the new entry
simply overwrites the existing entry. However, if `version={n}` is included,
then the update occurs only if the update timestamp in the current metadata DB
record matches that given in the REST call. Using a version provides a form of
"optimistic locking": first read the definition, make a change and send the
update using the update time from the read. Doing this prevents accidental
overwrites.
### Read Table
`GET {base}//tables/{dbSchema}/{name}`
Reads the entry for the table given by a schema and table name. The user
must have read access to the table. Returns a 404 (NOT FOUND) if the entry does
not exist. Remember: the metadata entry exists independent of the datasource
itself. The result is the `TableSpec` defined below.
### Drop Table
`DELETE {base}/tables/{dbSchema}/{name}[?ifExists=true|false]`
Drop the catalog entry for the given table and schema. The schema must
exist. The table must also exist, and the user must have write access to the
underlying datasource.
The optional `ifExists=true` parameter provides SQL `IF EXISTS` semantics:
no error is given if the table does not exist.
### List Schemas
`GET {base}/tables`
Retrieves the list of the names of schemas known to the catalog, which
includes the same set of schemas in the `INFORMATION_SCHEMA.SCHEMATA` table.
Note that, in the present version, the catalog does not support user-defined
schemas.
The list is filtered based on user permissions: the list will omit schemas
for which the user does not have read access.
### List Tables
`GET {base}/tables/{dbSchema}`
Retrieves the list of the names of tables within the given schema. This list
contains only those tables for which metadata entries appear, and is thus a
subset of those returned by `INFORMATION_SCHEMA.TABLES`. The list contains only
those tables for which the user has read access.
### Synchronize Table (Internal)
`GET {base}/sync/{dbSchema}/{name}`
Retrieve the entry for a single table within the given schema as a
`TableSpec` object. The user is assumed to be the Druid super-user. This API is
primarily for use by the Broker node.
### Synchronize Schema (Internal)
`GET {base}/sync/{dbSchema}`
Returns the list of all table metadata, as `TableSpec` objects, within the
given schema. The user is assumed to be the Druid super-user. This API is
primarily for use by the Broker node.
### Table Specification (`TableSpec`)
The `TableSpec` object holds two kinds of information: the system-defined
metadata about the entry, and the user-provided table definition (`TableDefn`).
All tables, regardless of kind (datasource, input, view, etc.) use the same
`TableSpec` object: only the table definition part varies.
Example:
```json
{
"dbSchema":"druid",
"name":"read",
"owner":null,
"creationTime":1654634106432,
"updateTime":1654634106432,
"state":"ACTIVE",
"defn": <TableDefn>
}
}
```
Fields:
* `dbSchema`: Schema name
* `name`: Table name
* `owner`: SQL owner. Not yet supported.
* `creationTime`: UTC Timestamp (in milliseconds since the epoch) of the
original creation time.
* `updateTime`: UTC Timestamp (in milliseconds since the epoch) of the most
recent creation or update. This is the record's "version" when using optimistic
locking.
* `state`: For datasources. Normally `ACTIVE`. May be `DELETING` if
datasource deletion is in progress. (Not yet supported.)
* `defn`: The user-defined table definition as described below.
When creating a table, omit the all fields except `name` and `defn`. The
owner, when supported, is the user that creates the entry.
### Table Definition (`TableDefn`)
The table definition holds the user-provided information for a catalog entry
for the table. The type of the definition depends on the type of the table. The
table definitions are preliminary and subject to change. (Let's make that
stronger: the definitions _will_ change.)
The common parts of all definitions include:
```json
"defn": {
"type": "<type>",
"properties": { }
}
```
Fields:
* `type`: Type of the table: one of the types defined below.
* `properties`: A generic key/value list of user or extension-defined
properties. Allows the application to attach application-specific data to the
table. Druid itself ignores these properties other than to save them and return
them via the API.
#### Datasource Definition
The datasource definition provides the values to use on ingestion and
initial compaction. Druid rules may change these properties as the data ages.
At present, such rules are _not_ reflected in the metadata catalog. Example:
```json
{
"type":"datasource",
"segmentGranularity": "PT1H",
"rollupGranularity": "PT1M",
"targetSegmentRows": 1000000,
"columns": [ ... ],
"properties": { ... }
}
```
Fields (excluding the common fields):
* `segmentGranularity`: The segment size, using one of Druid's supported
sizes. Equivalent to the `PARTITION BY` clause on `INSERT` statements.
* `rollupGranularity`: If `null` (or omitted), then this is a detail
datasource. If provided, then this is a rollup datasource with the ingest-time
rollup given here. The value must be one of Druid's valid rollup granularities.
Replaces the `GROUP BY` clause in an `INSERT` statement.
* `targetSegmentRows`: The number of rows to target per segment (when there
is enough data for segment size to matter.) Defaults to Druid's usual value of
5 million rows. Change the value only for extreme cases, such as very large
rows, or other tuning needs.
* `columns`: List of columns. Need not include all the actual data source
columns. Those listed here are expanded in the `SELECT *` statement in the
order defined here; others appear after in alphabetical order. Columns are of
various types depending on the data source.
TODO: Add column definitions.
More properties will come as work proceeds.
#### Input Source Definition
Defines an input source for `INSERT` and `SELECT` statements. An input
source is equivalent to an `EXTERN` definition, but with the definition stored
in the catalog.
At present, all values are defined in the catalog. Later work will allow
parameters for things like the list of file names.
Example for a CSV file:
```json
{
"type":"input",
"inputSource":{
"type":"inline",
"data":"a,b,1\nc,d,2\n"
},
"format":{
"type":"csv",
"columns":[
"x",
"y",
"z"
],
"listDelimiter":null,
"findColumnsFromHeader":false,
"skipHeaderRows":0
},
"columns":[
{
"type":"InputColumnDefn",
"name":"a",
"sqlType":"varchar"
}
],
"properties": { ... }
}
```
The three top-level fields are:
* `inputSource`: Druid Input source. See [the
docs](https://druid.apache.org/docs/latest/ingestion/data-formats.html#input-format).
* `format`: Druid input format. See [the
docs](https://druid.apache.org/docs/latest/ingestion/data-formats.html#input-format).
* `columns`: Corresponds to the `columns` field in Druid ingestion, but
using the column format defined here.
#### CSV Column Format
CSV columns appear in the order of columns in the CSV file. The three fields
are:
* `type`: Must be `InputColumnDefn`.
* `name`: The CSV column name. If the table has headers, this should
generally be the same as the header name.
* `sqlType`: The SQL type Druid should use for the column. Defaults to
`VARCHAR`. The CSV data for the column must be convertible to the provided type.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]