paul-rogers commented on issue #12546: URL: https://github.com/apache/druid/issues/12546#issuecomment-1191931096
The sections above described the syntax of the extended `staged` table function, and that the same property list is used in the catalog entry for an input table. This section details the properties themselves. Every input table must include an input source, and must typically include an input format, if the source allows multiple formats. In each case, the property names are converted internally to the corresponding JSON objects as described in the source code and documentation. This section is a cross-reference of property names: see the documentation for more information about each property. ## Input Source Properties This section describes [Input source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html) properties. Every input table definition (or `staged(.)` function call) must include the `source` property: | Property | Type | JSON Name | Description | | -------- | ---- | --------- | ----------- | | `source` | `VARCHAR` | `type` | Input source type | ### Inline The [inline input source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#inline-input-source) is most useful for testing. | Property | Type | JSON Name | Description | | -------- | ---- | --------- | ----------- | | `source` | `VARCHAR` | `type` | Must be`'inline'` | | `data` | `VARCHAR` | `data` | Inlined data to ingest | When including data in JSON or SQL, include actual newlines. The `\n` escape character doesn't work. Example: ```sql SELECT * FROM TABLE("input"."inline"(data => 'e,f,3 g,h,4 ')) ``` ### Local The [local input source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#local-input-source) reads from the local machine and is must useful when Druid runs on your laptop. | Property | Type | JSON Name | Description | | -------- | ---- | --------- | ----------- | | `source` | `VARCHAR` | `type` | Must be`'local'` | | `files ` | `VARCHAR` | `files ` | List of comma-separated files to ingest | | `file` | `VARCHAR` | `files` | Alias for `files` | | `fileFilter` | `VARCHAR` | `filter` | Optional file filter | | `baseDir` | `VARCHAR` | `baseDir` | Base directory | As with the [JSON version](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#local-input-source), you use the `local` format in one of two ways: * Give a list of files. Provide either `file` (for one file) or `files` (for a list.) * Give a base directory with `baseDir` and an optional file name filter with `fileFilter`. Example: ```sql SELECT * FROM (TABLE(staged( source => 'local', file => 'myWiki.csv', format => 'csv')) ``` ### HTTP The [HTTP input source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#http-input-source) lets you read data from the internet. | Property | Type | JSON Name | Description | | -------- | ---- | --------- | ----------- | | `source` | `VARCHAR` | `type` | Must be`'http'` | | `uris ` | `VARCHAR` | `uris ` | List of comma-separated URIs to ingest | | `uri` | `VARCHAR` | `uris` | Alias for `uris` | | `user` | `VARCHAR` | `httpAuthenticationUsername` | Optional user name | | `password` | `VARCHAR` | `httpAuthenticationPassword` | Optional password | ### Others TBD. The above is a "starter set". ## Input Format Properties This section describes [Input format](https://druid.apache.org/docs/latest/ingestion/data-formats.html) properties. Every input table definition (or `staged(.)` function call) must include the `format` property, if the input source requires a format. | Property | Type | JSON Name | Description | | -------- | ---- | --------- | ----------- | | `format` | `VARCHAR` | `type` | Input format type | ### CSV Defines a [CSV](https://druid.apache.org/docs/latest/ingestion/data-formats.html#csv) (or, more generally, a delimited text) input format. | Property | Type | JSON Name | Description | | -------- | ---- | --------- | ----------- | | `format` | `VARCHAR` | `type` | Must be`'csv'` | | `delimiter ` | `VARCHAR` | `listDelimiter ` | Defaults to comma | | `skipRows` | `INTEGER` | `skipHeaderRows` | Number of rows to skip | Only the `format` property is required: Druid provides defaults for the other values. The `delimiter` defaults to a comma. The [documentation](https://druid.apache.org/docs/latest/ingestion/data-formats.html#csv) says that it defaults to Ctrl-A, but that seems to not be the case, at least in this context. When used in an input table, Druid does not yet support the ability to infer column names from the input file, so the `findColumnsFromHeader` is not supported here. CSV requires a list of columns. Provide these in the `columns` section of the input table definition in the catalog, or using the extended function notation in SQL: ```sql SELECT * FROM TABLE(staged( source => 'inline', data => 'a,b,1 c,d,2 ', format => 'csv' )) (x VARCHAR NOT NULL, y VARCHAR NOT NULL, z BIGINT NOT NULL) ``` ### JSON Defines a [JSON](https://druid.apache.org/docs/latest/ingestion/data-formats.html#json) input format. | Property | Type | JSON Name | Description | | -------- | ---- | --------- | ----------- | | `format` | `VARCHAR` | `type` | Must be`'json'` | The catalog does not yet support the `flattenSpec` or `featureSpec` properties. Although batch ingestion does not require a list of columns, the multi-stage engine does. Provide columns the same way as described for the CSV format above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
