[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

GitBox Thu, 21 Jul 2022 14:01:17 -0700


paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1191931096


   The sections above described the syntax of the extended `staged` table 
function, and that the same property list is used in the catalog entry for an 
input table. This section details the properties themselves.
   
   Every input table must include an input source, and must typically include 
an input format, if the source allows multiple formats. In each case, the 
property names are converted internally to the corresponding JSON objects as 
described in the source code and documentation. This section is a 
cross-reference of property names: see the documentation for more information 
about each property.
   
   ## Input Source Properties
   
   This section describes [Input 
source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html)
 properties. Every input table definition (or `staged(.)` function call) must 
include the `source` property:
   
   | Property | Type | JSON Name | Description |
   | -------- | ---- | --------- | ----------- |
   | `source` | `VARCHAR` | `type` | Input source type |
   
   ### Inline
   
   The [inline input 
source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#inline-input-source)
 is most useful for testing.
   
   | Property | Type | JSON Name | Description |
   | -------- | ---- | --------- | ----------- | 
   | `source` | `VARCHAR` | `type` |  Must be`'inline'` | 
   | `data` | `VARCHAR` | `data` | Inlined data to ingest |
   
   When including data in JSON or SQL, include actual newlines. The `\n` escape 
character doesn't work. Example:
   
   ```sql
   SELECT *
   FROM TABLE("input"."inline"(data => 'e,f,3
   g,h,4
   '))
   ```
   
   ### Local
   
   The [local input 
source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#local-input-source)
 reads from the local machine and is must useful when Druid runs on your laptop.
   
   | Property | Type | JSON Name | Description |
   | -------- | ---- | --------- | ----------- | 
   | `source` | `VARCHAR` | `type` |  Must be`'local'` | 
   | `files ` | `VARCHAR` | `files ` | List of comma-separated files to ingest |
   | `file` | `VARCHAR` | `files` | Alias for `files`  |
   | `fileFilter` | `VARCHAR` | `filter` | Optional file filter |
   | `baseDir` | `VARCHAR` | `baseDir` | Base directory |
   
   As with the [JSON 
version](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#local-input-source),
 you use the `local` format in one of two ways:
   
   * Give a list of files. Provide either `file` (for one file) or `files` (for 
a list.)
   * Give a base directory with `baseDir` and an optional file name filter with 
`fileFilter`.
   
   Example:
   
   ```sql
   SELECT *
   FROM (TABLE(staged(
     source => 'local',
     file => 'myWiki.csv',
     format => 'csv'))
   ```
   
   ### HTTP
   
   The [HTTP input 
source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#http-input-source)
 lets you read data from the internet.
   
   | Property | Type | JSON Name | Description |
   | -------- | ---- | --------- | ----------- | 
   | `source` | `VARCHAR` | `type` |  Must be`'http'` | 
   | `uris ` | `VARCHAR` | `uris ` | List of comma-separated URIs to ingest |
   | `uri` | `VARCHAR` | `uris` | Alias for `uris` |
   | `user` | `VARCHAR` | `httpAuthenticationUsername` | Optional user name |
   | `password` | `VARCHAR` | `httpAuthenticationPassword` | Optional password |
   
   ### Others
   
   TBD. The above is a "starter set". 
   
   ## Input Format Properties
   
   This section describes [Input 
format](https://druid.apache.org/docs/latest/ingestion/data-formats.html) 
properties. Every input table definition (or `staged(.)` function call) must 
include the `format` property, if the input source requires a format.
   
   | Property | Type | JSON Name | Description |
   | -------- | ---- | --------- | ----------- |
   | `format` | `VARCHAR` | `type` | Input format type |
   
   ### CSV
   
   Defines a 
[CSV](https://druid.apache.org/docs/latest/ingestion/data-formats.html#csv) 
(or, more generally, a delimited text) input format.
   
   | Property | Type | JSON Name | Description |
   | -------- | ---- | --------- | ----------- | 
   | `format` | `VARCHAR` | `type` |  Must be`'csv'` | 
   | `delimiter ` | `VARCHAR` | `listDelimiter ` | Defaults to comma |
   | `skipRows` | `INTEGER` | `skipHeaderRows` | Number of rows to skip |
   
   Only the `format` property is required: Druid provides defaults for the 
other values.
   
   The `delimiter` defaults to a comma. The 
[documentation](https://druid.apache.org/docs/latest/ingestion/data-formats.html#csv)
 says that it defaults to Ctrl-A, but that seems to not be the case, at least 
in this context.
   
   When used in an input table, Druid does not yet support the ability to infer 
column names from the input file, so the `findColumnsFromHeader` is not 
supported here. 
   
   CSV requires a list of columns. Provide these in the `columns` section of 
the input table definition in the catalog, or using the extended function 
notation in SQL:
   
   ```sql
   SELECT *
   FROM TABLE(staged(
     source => 'inline',
     data => 'a,b,1
   c,d,2
   ',
     format => 'csv'
     ))
     (x VARCHAR NOT NULL, y VARCHAR NOT NULL, z BIGINT NOT NULL)
   ```
   
   ### JSON
   
   Defines a 
[JSON](https://druid.apache.org/docs/latest/ingestion/data-formats.html#json) 
input format.
   
   | Property | Type | JSON Name | Description |
   | -------- | ---- | --------- | ----------- | 
   | `format` | `VARCHAR` | `type` |  Must be`'json'` | 
   
   The catalog does not yet support the `flattenSpec` or `featureSpec` 
properties.
   
   Although batch ingestion does not require a list of columns, the multi-stage 
engine does. Provide columns the same way as described for the CSV format above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

Reply via email to