[GitHub] [druid] zachjsh commented on a diff in pull request #13627: Much improved table functions

GitBox Tue, 10 Jan 2023 10:18:57 -0800


zachjsh commented on code in PR #13627:
URL: https://github.com/apache/druid/pull/13627#discussion_r1066140953



##########
docs/multi-stage-query/reference.md:
##########
@@ -88,93 +118,260 @@ FROM TABLE(
   http(
     userName => 'bob',
     password => 'secret',
-    uris => 'http:foo.com/bar.csv',
+    uris => ARRAY['http:example.com/foo.csv', 'http:example.com/bar.csv'],
     format => 'csv'
     )
   ) (x VARCHAR, y VARCHAR, z BIGINT)
 ```
 
+#### Function Arguments
 
-The set of table functions and formats is preliminary in this release.
+These table functions are intended for use with the SQL by-name argument syntax
+as shown above. Because the functions include all parameters for all formats,
+using positional calls is both cumbersome and error-prone.
+
+Function argument names are generally the same as the JSON field names, except
+as noted below. Each argument has a SQL type which matches the JSON type. For
+arguments that take a string list in JSON, use the SQL `ARRAY[...]` syntax in
+SQL as shown in the above example.
+
+Array parameters are good candidates for use in parameterized queries. That is:
+
+```sql
+SELECT
+ <column>
+FROM TABLE(
+  http(
+    userName => 'bob',
+    password => 'secret',
+    uris => ?,
+    format => 'csv'
+    )
+  ) (x VARCHAR, y VARCHAR, z BIGINT)
+```
 
-#### `HTTP`
+Provide the list of URIs (in this case) as a query parameter in each ingest. 
Doing
+so is simpler than writing a script to insert the array into the SQL text.
 
-The `HTTP` table function represents the `HttpInputSource` class in Druid 
which allows you to
-read from an HTTP server. The function accepts the following arguments:
+#### `HTTP` Function
 
-| Name | Description | JSON equivalent | Required |
-| ---- | ----------- | --------------- | -------- |
-| `userName` | Basic authentication user name | `httpAuthenticationUsername` | 
No |
-| `password` | Basic authentication password | `httpAuthenticationPassword` | 
No |
-| `passwordEnvVar` | Environment variable that contains the basic 
authentication password| `httpAuthenticationPassword` | No |
-| `uris` | Comma-separated list of URIs to read. | `uris` | Yes |
+The `HTTP` table function represents the
+[HTTP input 
source](../ingestion/native-batch-input-sources.md#http-input-source)
+to read from an HTTP server. The function accepts the following arguments:
 
-#### `INLINE`
+* `userName` (`VARCHAR`) -  Same as JSON `httpAuthenticationUsername`.
+* `password`  (`VARCHAR`) - Same as`httpAuthenticationPassword` when used with 
the default option.
+* `passwordEnvVar` (`VARCHAR`) - Same as the HTTP `httpAuthenticationPassword` 
when used with
+  the `"type": "environment"` option.
+* `uris` (`ARRAY` of `VARCHAR`)
 
-The `INLINE` table function represents the `InlineInputSource` class in Druid 
which provides
-data directly in the table function. The function accepts the following 
arguments:
+#### `INLINE` Function
 
-| Name | Description | JSON equivalent | Required |
-| ---- | ----------- | --------------- | -------- |
-| `data` | Text lines of inline data. Separate lines with a newline. | `data` 
| Yes |
+The `INLINE` table function represents the
+[Inline input 
source](../ingestion/native-batch-input-sources.md#inline-input-source)
+which provides data directly in the table function. Parameter:
 
-#### `LOCALFILES`
+* `data` (`ARRAY` of `VARCHAR`) - Data lines, without a trailing newline, as 
an array.
 
-The `LOCALFILES` table function represents the `LocalInputSource` class in 
Druid which reads
+Example:
+
+```sql
+FROM TABLE(
+  inline(
+    data => ARRAY[
+       "a,b",
+       "c,d"],
+    format => 'csv'
+    )
+  ) (x VARCHAR, y VARCHAR)
+```
+
+
+#### `LOCALFILES` Function
+
+The `LOCALFILES` table function represents the
+[Local input 
source](../ingestion/native-batch-input-sources.md#local-input-source) which 
reads
 files from the file system of the node running Druid. This is most useful for 
single-node
-installations. The function accepts the following arguments:
+installations. The function accepts the following parameters:
 
-| Name | Description | JSON equivalent | Required |
-| ---- | ----------- | --------------- | -------- |
-| `baseDir` | Directory to read from. | `baseDir` | No |
-| `filter` | Filter pattern to read. Example: `*.csv`. | `filter` | No |
-| `files` | Comma-separated list of files to read. | `files` | No |
+* `baseDir`
+* `filter`
+* `files`
 
-You must either provide the `baseDir` or the list of `files`. You can provide 
both, in which case
-the files are assumed relative to the `baseDir`. If you provide a `filter`, 
you must provide the
-`baseDir`.
+When the local files input source is used directly in an `extern` function, or 
ingestion spec, you
+can provide either `baseDir` and `filter` or `files` but not both. This 
function, however, allows
+you to provide any of the following combinations:
 
-#### Table Function Format
+* `baseDir` - Matches all files in the given directory. (Assumes the filter is 
`*`.)
+* `baseDir` and `filter` - Match files in the given directory using the filter.
+* `baseDir` and `files` - A set of files relative to `baseDir`.
+* `files` - The files should be absolute paths, else they will be computed 
relative to Druid's
+  working directory (usually the Druid install directory.)
 
-Each of the table functions above requires that you specify a format.
+Examples:
 
-| Name | Description | JSON equivalent | Required |
-| ---- | ----------- | --------------- | -------- |
-| `format` | The input format, using the same names as for `EXTERN`. | 
`inputFormat.type` | Yes |
+```sql
+  -- All files in /tmp, which must be CSV files
+  localfiles(baseDir => '/tmp',
+             format => 'csv')
+
+  -- CSV files in /tmp
+  localfiles(baseDir => '/tmp',
+             filter => '*.csv',
+             format => 'csv')
+
+  -- /tmp/a.csv and /tmp/b.csv
+  localfiles(baseDir => '/tmp',
+             files => ARRAY['a.csv', 'b.csv'],
+             format => 'csv')
+
+  -- /tmp/a.csv and /tmp/b.csv
+  localfiles(files => ARRAY['/tmp/a.csv', '/tmp/b.csv'],
+             format => 'csv')
+```
 
-#### CSV Format
+#### `S3` Function
 
-Use the `csv` format to read from CSV. This choice selects the Druid 
`CsvInputFormat` class.
+The `S3` table function represents the
+[S3 input source](../ingestion/native-batch-input-sources.md#s3-input-source) 
which reads
+files from an S3 bucket. The function accepts the following parameters to 
specify the
+objects to read:
 
-| Name | Description | JSON equivalent | Required |
-| ---- | ----------- | --------------- | -------- |
-| `listDelimiter` | The delimiter to use for fields that represent a list of 
strings. | `listDelimiter` | No |
-| `skipRows` | The number of rows to skip at the start of the file. Default is 
0. | `skipHeaderRows` | No |
+* `uris` (`ARRAY` of `VARCHAR`)
+* `prefix` (`VARCHAR`) - Corresponds to the JSON `prefixes` property, but 
allows a single
+  prefix.
+* `bucket` (`VARCHAR`) - Corresponds to the `bucket` field of the `objects` 
JSON field. SQL
+  does not have syntax for an array of objects. Instead, this function taks a 
single bucket,
+  and one or more objects within that bucket.
+* `paths` (`ARRAY` of `VARCHAR`) - Corresponds to the `path` fields of the 
`object` JSON field.
+  All paths are within the single `bucket` parameter.
 
-MSQ does not have the ability to infer schema from a CSV, file, so the 
`findColumnsFromHeader` property
-is unavailable. Instead, Columns are given using the `EXTEND` syntax described 
above.
+The S3 input source accepts one of the following patterns:
 
-#### Delimited Text Format
+* `uris` - A list of fully-qualified object URIs.
+* `prefixes` - A list of fully-qualified "folder" prefixes.
+* `bucket` and `paths` - A list of objects relative to the given bucket path.
+
+The `S3` function also accepts the following security properties:
 
-Use the `tsv` format to read from an arbitrary delimited (CSV-like) file such 
as tab-delimited,
-pipe-delimited, etc. This choice selects the Druid `DelimitedInputFormat` 
class.
+* `accessKeyId` (`VARCHAR`)
+* `secretAccessKey` (`VARCHAR`)
+* `assumeRoleArn` (`VARCHAR`)
 
-| Name | Description | JSON equivalent | Required |
-| ---- | ----------- | --------------- | -------- |
-| `delimiter` | The delimiter which separates fields. | `delimiter` | Yes |
-| `listDelimiter` | The delimiter to use for fields that represent a list of 
strings. | `listDelimiter` | No |
-| `skipRows` | The number of rows to skip at the start of the file. Default is 
0. | `skipHeaderRows` | No |
+The `S3` table function does not support either the `clientConfig` or 
`proxyConfig`
+JSON properties.
 
-As noted above, MSQ cannot infer schema using headers. Use `EXTEND` instead.
+If you need the full power of the S3 input source, then consider the use of 
the `extern`
+function, which accepts the full S3 input source serializd as JSON. 
Alternatively,
+create a catalog external table that has the full set of properties, leaving 
just the
+`uris` or `paths` to be provided at query time.
+
+Examples, each of which correspond to an example on the
+[S3 input source](../ingestion/native-batch-input-sources.md#s3-input-source) 
page.
+The examples omit the format and schema; however you must remember to provide 
those
+in an actual query.
+
+```sql
+  TABLE(S3(uris => ARRAY['s3://foo/bar/file.json', 's3://bar/foo/file2.json']))

Review Comment:
   nit: to make consistent with the format of the localFiles function given 
above, should we drop the `TABLE(` prefix here, and specify the format?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] zachjsh commented on a diff in pull request #13627: Much improved table functions

Reply via email to