[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

GitBox Tue, 26 Jul 2022 17:44:24 -0700


paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1196145784


   ## Table Metadata
   
   The table metadata object holds two kinds of information: the system-defined 
metadata about the entry, and the user-provided table specification 
(`TableSpec`). All tables, regardless of kind (datasource, input, view, etc.) 
use the same table metadata object: only the table specification part varies.
   
   Example:
   
   ```json
   {
      "dbSchema":"druid",
      "name":"read",
      "creationTime":1654634106432,
      "updateTime":1654634106432,
      "state":"ACTIVE",
      "spec": <TableSpec>
      }
   }
   ```
   
   Fields:
   
   * `dbSchema`: Schema name
   * `name`: Table name
   * `owner`: SQL owner. Not yet supported. Omitted if null (which is the only 
valid value at present.)
   * `creationTime`: UTC Timestamp (in milliseconds since the epoch) of the 
original creation time.
   * `updateTime`: UTC Timestamp (in milliseconds since the epoch) of the most 
recent creation or update. This is the record's "version" when using optimistic 
locking.
   *  `state`: For datasources. Normally `ACTIVE`. May be `DELETING` if 
datasource deletion is in progress. (Not yet supported.)
   * `spec`: The user-defined table specification as described below.
   
   When creating a table, omit the all fields except `dbSchema`, `name` and 
`spec`. The owner, when supported, is the user that creates the entry.
   
   ### Table Specification (`TableSpec`)
   
   The table specification holds the user-provided information for a catalog 
entry for the table. The type of the specification depends on the type of the 
table. The table specifications are preliminary and subject to change. (Let's 
make that stronger: the definitions _will_ change.)
   
   The common parts of all definitions include:
   
   ```json
   "spec": {
     "type": "<type>",
     "tags": {  }
   }
   ```
   
   Fields:
   
   * `type`: Type of the table: one of the types defined below.
   * `tags`: A generic key/value list of user or extension-defined tags. Allows 
the application to attach application-specific data to the table. Druid itself 
ignores these tags other than to save them and return them via the API.
   
   #### Datasource Specification
   
   The datasource specification provides the values to use on ingestion and 
initial compaction. Druid rules may change these properties as the data ages. 
At present, such rules are _not_ reflected in the metadata catalog. Example:
   
   ```json
   {
      "type":"datasource",
      "segmentGranularity": "PT1H",
      "targetSegmentRows": 1000000,
      "columns": [ ... ],
      "tags": { ... }
   }
   ```
   
   Fields (excluding the common fields):
   
   * `segmentGranularity`: The segment size, using one of Druid's supported 
sizes. Equivalent to the `PARTITION BY` clause on `INSERT` statements.
   * `targetSegmentRows`: The number of rows to target per segment (when there 
is enough data for segment size to matter.) Defaults to Druid's usual value of 
5 million rows. Change the value only for extreme cases, such as very large 
rows, or other tuning needs.
   * `columns`: List of columns. Need not include all the actual data source 
columns. Those listed here are expanded in the `SELECT *` statement in the 
order defined here; others appear after in alphabetical order. Columns are of 
various types depending on the data source.
   
   More properties will come as work proceeds.
   
   #### Datasource Column Specification
   
   Datasource columns are of one of three types, depending on the kind of 
datasource. For a detail (non-rollup) datasource, use the `column` type. For a 
roll-up table, use either `dimension` or `measure`. The general form is:
   
   ```json
   {
     "type": "<type>",
     "name": "<name>",
     "sqlType": "<type>",
     "tags": { ... }
   }
   ```
   
   The `name` is required. The `sqlType` is the column type. (Note: don't 
confuse the column type, `sqlType` with the type of the JSON object, `type`.) 
The `tags` are optional.
   
   For detail columns or dimensions, the SQL type is optional or must be one of 
`VARCHAR`, `BIGINT`, `FLOAT` or `DOUBLE`. For the `__time` column, the type 
must be `TIMESTAMP`. (A later comment will discuss how the type is used.)
   
   For measure columns, the type is required so that Druid knows how to create 
the column. The measure type is compound: it includes the name of an [aggregate 
function](https://druid.apache.org/docs/latest/querying/aggregations.html) 
along with the argument types. The "starter" set of types is the following:
   
   | SQL Type | Native Type |
   | -------- | ---------- |
   | `COUNT()` |  `count` |
   | `SUM(BIGINT \| FLOAT \| DOUBLE)` | `longSum` \|  `floatSum` \| `doubleSum` 
|
   | `MIN(BIGINT \| FLOAT \| DOUBLE)` | `longMin` \|  `floatMin` \| `doubleMin` 
|
   | `MAX(BIGINT \| FLOAT \| DOUBLE)` | `longMax` \|  `floatMax` \| `doubleMax`|
   
   ### Input Source Specification
   
   Defines an input source for `INSERT` and `SELECT` statements. An input 
source is equivalent to an `EXTERN` definition, but with the definition stored 
in the catalog. (See a note later about the extended table functions.)
   
   Druid defines the input source and format as JSON objects. The existing 
`EXTERN` function uses JSON strings to set these values. The catalog uses a 
different approach: a "flattened" list of names which Druid maps internally to 
the fields in the JSON structure. This approach allows us to mesh the input 
source specification with table function arguments.
   
   Example for an inline CSV file:
   
   ```json
   {
     "type": "input",
     "properties": {
       "source": "inline",
       "format": "csv",
       "data":"a,b,1\nc,d,2\n"
      },
      "columns": [
         {
            "name":"a",
            "sqlType":"varchar"
         },
        ...
      ],
      "tags": { ... }
   }
   ```
   
   The three top-level fields are:
   
   * `type`: Must be `"input"` to say it is an input source definition.
   * `properties`: A set of name/value pairs using the same names and values as 
used in the extended table function.
   * `columns`: Corresponds to the `columns` field in Druid ingestion, but 
using the column format defined here.
   
   Column properties:
   
   * `name`: A valid SQL identifier for the column. Case-sensitive.
   * `sqlType`: One of Druid's scalar data types: `VARCHAR`, `FLOAT`, `DOUBLE` 
or `BIGINT`. Case-insensitive.
   
   Columns must appear in the same order that they appear in the input file.
   
   See a later note for details of the supported input sources and formats, and 
their properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

Reply via email to