paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1196145784
## Table Metadata
The table metadata object holds two kinds of information: the system-defined
metadata about the entry, and the user-provided table specification
(`TableSpec`). All tables, regardless of kind (datasource, input, view, etc.)
use the same table metadata object: only the table specification part varies.
Example:
```json
{
"dbSchema":"druid",
"name":"read",
"creationTime":1654634106432,
"updateTime":1654634106432,
"state":"ACTIVE",
"spec": <TableSpec>
}
}
```
Fields:
* `dbSchema`: Schema name
* `name`: Table name
* `owner`: SQL owner. Not yet supported. Omitted if null (which is the only
valid value at present.)
* `creationTime`: UTC Timestamp (in milliseconds since the epoch) of the
original creation time.
* `updateTime`: UTC Timestamp (in milliseconds since the epoch) of the most
recent creation or update. This is the record's "version" when using optimistic
locking.
* `state`: For datasources. Normally `ACTIVE`. May be `DELETING` if
datasource deletion is in progress. (Not yet supported.)
* `spec`: The user-defined table specification as described below.
When creating a table, omit the all fields except `dbSchema`, `name` and
`spec`. The owner, when supported, is the user that creates the entry.
### Table Specification (`TableSpec`)
The table specification holds the user-provided information for a catalog
entry for the table. The type of the specification depends on the type of the
table. The table specifications are preliminary and subject to change. (Let's
make that stronger: the definitions _will_ change.)
The common parts of all definitions include:
```json
"spec": {
"type": "<type>",
"tags": { }
}
```
Fields:
* `type`: Type of the table: one of the types defined below.
* `tags`: A generic key/value list of user or extension-defined tags. Allows
the application to attach application-specific data to the table. Druid itself
ignores these tags other than to save them and return them via the API.
#### Datasource Specification
The datasource specification provides the values to use on ingestion and
initial compaction. Druid rules may change these properties as the data ages.
At present, such rules are _not_ reflected in the metadata catalog. Example:
```json
{
"type":"datasource",
"segmentGranularity": "PT1H",
"targetSegmentRows": 1000000,
"columns": [ ... ],
"tags": { ... }
}
```
Fields (excluding the common fields):
* `segmentGranularity`: The segment size, using one of Druid's supported
sizes. Equivalent to the `PARTITION BY` clause on `INSERT` statements.
* `targetSegmentRows`: The number of rows to target per segment (when there
is enough data for segment size to matter.) Defaults to Druid's usual value of
5 million rows. Change the value only for extreme cases, such as very large
rows, or other tuning needs.
* `columns`: List of columns. Need not include all the actual data source
columns. Those listed here are expanded in the `SELECT *` statement in the
order defined here; others appear after in alphabetical order. Columns are of
various types depending on the data source.
More properties will come as work proceeds.
#### Datasource Column Specification
Datasource columns are of one of three types, depending on the kind of
datasource. For a detail (non-rollup) datasource, use the `column` type. For a
roll-up table, use either `dimension` or `measure`. The general form is:
```json
{
"type": "<type>",
"name": "<name>",
"sqlType": "<type>",
"tags": { ... }
}
```
The `name` is required. The `sqlType` is the column type. (Note: don't
confuse the column type, `sqlType` with the type of the JSON object, `type`.)
The `tags` are optional.
For detail columns or dimensions, the SQL type is optional or must be one of
`VARCHAR`, `BIGINT`, `FLOAT` or `DOUBLE`. For the `__time` column, the type
must be `TIMESTAMP`. (A later comment will discuss how the type is used.)
For measure columns, the type is required so that Druid knows how to create
the column. The measure type is compound: it includes the name of an [aggregate
function](https://druid.apache.org/docs/latest/querying/aggregations.html)
along with the argument types. The "starter" set of types is the following:
| SQL Type | Native Type |
| -------- | ---------- |
| `COUNT()` | `count` |
| `SUM(BIGINT \| FLOAT \| DOUBLE)` | `longSum` \| `floatSum` \| `doubleSum`
|
| `MIN(BIGINT \| FLOAT \| DOUBLE)` | `longMin` \| `floatMin` \| `doubleMin`
|
| `MAX(BIGINT \| FLOAT \| DOUBLE)` | `longMax` \| `floatMax` \| `doubleMax`|
### Input Source Specification
Defines an input source for `INSERT` and `SELECT` statements. An input
source is equivalent to an `EXTERN` definition, but with the definition stored
in the catalog. (See a note later about the extended table functions.)
Druid defines the input source and format as JSON objects. The existing
`EXTERN` function uses JSON strings to set these values. The catalog uses a
different approach: a "flattened" list of names which Druid maps internally to
the fields in the JSON structure. This approach allows us to mesh the input
source specification with table function arguments.
Example for an inline CSV file:
```json
{
"type": "input",
"properties": {
"source": "inline",
"format": "csv",
"data":"a,b,1\nc,d,2\n"
},
"columns": [
{
"name":"a",
"sqlType":"varchar"
},
...
],
"tags": { ... }
}
```
The three top-level fields are:
* `type`: Must be `"input"` to say it is an input source definition.
* `properties`: A set of name/value pairs using the same names and values as
used in the extended table function.
* `columns`: Corresponds to the `columns` field in Druid ingestion, but
using the column format defined here.
Column properties:
* `name`: A valid SQL identifier for the column. Case-sensitive.
* `sqlType`: One of Druid's scalar data types: `VARCHAR`, `FLOAT`, `DOUBLE`
or `BIGINT`. Case-insensitive.
Columns must appear in the same order that they appear in the input file.
See a later note for details of the supported input sources and formats, and
their properties.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]