Re: [PR] Add export capabilities to MSQ with SQL syntax (druid)

via GitHub Tue, 06 Feb 2024 12:23:57 -0800


317brian commented on code in PR #15689:
URL: https://github.com/apache/druid/pull/15689#discussion_r1480461042



##########
docs/multi-stage-query/concepts.md:
##########
@@ -115,6 +115,14 @@ When deciding whether to use `REPLACE` or `INSERT`, keep 
in mind that segments g
 with dimension-based pruning but those generated with `INSERT` cannot. For 
more information about the requirements
 for dimension-based pruning, see [Clustering](#clustering).
 
+### Write to an external destination with `EXTERN`
+
+Query tasks can write data to an external destination through the `EXTERN` 
function, when it is used with the `INTO`
+clause, such as `REPLACE INTO EXTERN(...)` The EXTERN function takes arguments 
which specifies where to the files should be created.
+The format can be specified using an `AS` clause.

Review Comment:
   ```suggestion
   Query tasks can write data to an external destination through the `EXTERN` 
function when it is used with the `INTO`
   clause, such as `REPLACE INTO EXTERN(...)`. The EXTERN function takes 
arguments that specify where to write the files.
   The format can be specified using an `AS` clause.
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.

Review Comment:
   ```suggestion
   This variation of EXTERN requires two arguments: the details of the 
destination and an `AS` clause to specify the format of the exported rows.
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql

Review Comment:
   ```suggestion
   INSERT statements append the results to the existing files at the 
destination.
   
   ```sql
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+```sql
+INSERT INTO
+  EXTERN(
+    S3(bucket => 's3://your_bucket', prefix => 'prefix/to/files')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:
+
+| Parameter   | Required | Description                                         
                                                                                
                                                                                
               | Default |
+|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| `bucket`    | Yes      | The S3 bucket to which the files are exported to.   
                                                                                
                                                                                
               | n/a     |
+| `prefix`    | Yes      | Path where the exported files would be created. The 
export query would expect the destination to be empty. If the location includes 
other files, then the query will fail.                                          
               | n/a     |
+
+The following runtime parameters must be configured to export into an S3 
destination:
+
+| Runtime Parameter                            | Required | Description        
                                                                                
                                                                                
                                                | Default |
+|----------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
+| `druid.export.storage.s3.tempSubDir`         | Yes      | Directory used to 
store temporary files required while uploading the data.                        
                                                                                
                                                 | n/a |
+| `druid.export.storage.s3.allowedExportPaths` | Yes      | An array of S3 
prefixes which are whitelisted as export destinations. Export query fail if the 
export destination does not match any of the configured prefixes. Example: 
`[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]`     | n/a |

Review Comment:
   ```suggestion
   | `druid.export.storage.s3.allowedExportPaths` | Yes      | An array of S3 
prefixes that are whitelisted as export destinations. Export queries fail if 
the export destination does not match any of the configured prefixes. Example: 
`[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]`     | n/a |
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+```sql
+INSERT INTO
+  EXTERN(
+    S3(bucket => 's3://your_bucket', prefix => 'prefix/to/files')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:
+
+| Parameter   | Required | Description                                         
                                                                                
                                                                                
               | Default |
+|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| `bucket`    | Yes      | The S3 bucket to which the files are exported to.   
                                                                                
                                                                                
               | n/a     |
+| `prefix`    | Yes      | Path where the exported files would be created. The 
export query would expect the destination to be empty. If the location includes 
other files, then the query will fail.                                          
               | n/a     |
+
+The following runtime parameters must be configured to export into an S3 
destination:
+
+| Runtime Parameter                            | Required | Description        
                                                                                
                                                                                
                                                | Default |
+|----------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
+| `druid.export.storage.s3.tempSubDir`         | Yes      | Directory used to 
store temporary files required while uploading the data.                        
                                                                                
                                                 | n/a |
+| `druid.export.storage.s3.allowedExportPaths` | Yes      | An array of S3 
prefixes which are whitelisted as export destinations. Export query fail if the 
export destination does not match any of the configured prefixes. Example: 
`[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]`     | n/a |
+| `druid.export.storage.s3.maxRetry`           | No       | Defines the max 
number times to attempt S3 API calls to avoid failures due to transient errors. 
                                                                                
                                                   | 10  |
+| `druid.export.storage.s3.chunkSize`          | No       | Defines the size 
of each chunk to temporarily store in `tempDir`. The chunk size must be between 
5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it 
requires more disk space to store the temporary chunks. | 100MiB |
+
+##### LOCAL
+
+Exporting is also supported to the local storage, which exports the results to 
the filesystem of the MSQ worker.
+This is useful in a single node setup or for testing, and is not suitable for 
production use cases.
+
+This can be done by passing the function `LOCAL()` as an argument to the 
`EXTERN FUNCTION`.
+Arguments to `LOCAL()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+To use local as an export destination, the runtime property 
`druid.export.storage.baseDir` must be configured on the indexer/middle manager.
+The parameter provided to the `LOCAL()` function will be prefixed with this 
value when exporting to a local destination.

Review Comment:
   ```suggestion
   
   Arguments to `LOCAL()` should be passed as named parameters with the value 
in single quotes in the following example:
   
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.

Review Comment:
   ```suggestion
   Keep the following in mind when using EXTERN to export rows:
   
   - Only INSERT statements are supported.
   - Only `CSV` format is supported as an export format.
   - Partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED BY`) aren't 
supported with export statements.
   - You can export to Amazon S3 or local storage.
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.

Review Comment:
   ```suggestion
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value

Review Comment:
   When you export data, use the `rowsPerPage` context parameter to control how 
many rows get exported. The default is 100,000.



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+```sql
+INSERT INTO
+  EXTERN(
+    S3(bucket => 's3://your_bucket', prefix => 'prefix/to/files')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:
+
+| Parameter   | Required | Description                                         
                                                                                
                                                                                
               | Default |
+|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| `bucket`    | Yes      | The S3 bucket to which the files are exported to.   
                                                                                
                                                                                
               | n/a     |
+| `prefix`    | Yes      | Path where the exported files would be created. The 
export query would expect the destination to be empty. If the location includes 
other files, then the query will fail.                                          
               | n/a     |
+
+The following runtime parameters must be configured to export into an S3 
destination:
+
+| Runtime Parameter                            | Required | Description        
                                                                                
                                                                                
                                                | Default |
+|----------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
+| `druid.export.storage.s3.tempSubDir`         | Yes      | Directory used to 
store temporary files required while uploading the data.                        
                                                                                
                                                 | n/a |
+| `druid.export.storage.s3.allowedExportPaths` | Yes      | An array of S3 
prefixes which are whitelisted as export destinations. Export query fail if the 
export destination does not match any of the configured prefixes. Example: 
`[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]`     | n/a |
+| `druid.export.storage.s3.maxRetry`           | No       | Defines the max 
number times to attempt S3 API calls to avoid failures due to transient errors. 
                                                                                
                                                   | 10  |
+| `druid.export.storage.s3.chunkSize`          | No       | Defines the size 
of each chunk to temporarily store in `tempDir`. The chunk size must be between 
5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it 
requires more disk space to store the temporary chunks. | 100MiB |
+
+##### LOCAL
+
+Exporting is also supported to the local storage, which exports the results to 
the filesystem of the MSQ worker.

Review Comment:
   ```suggestion
   You can export to the local storage, which exports the results to the 
filesystem of the MSQ worker.
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.

Review Comment:
   ```suggestion
   `EXTERN` can be used to specify a destination where you want to export data 
to.
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.

Review Comment:
   ```suggestion
   Export results to S3 by passing the function `S3()` as an argument to the 
`EXTERN` function. Note that this requires the `druid-s3-extensions`.
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+```sql
+INSERT INTO
+  EXTERN(
+    S3(bucket => 's3://your_bucket', prefix => 'prefix/to/files')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:

Review Comment:
   ```suggestion
   Supported arguments for the function:
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.

Review Comment:
   ```suggestion
   The `S3()` function is a Druid function that configures the connection. 
Arguments for `S3()` should be passed as named parameters with the value in 
single quotes like the following example: 
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+```sql
+INSERT INTO
+  EXTERN(
+    S3(bucket => 's3://your_bucket', prefix => 'prefix/to/files')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:
+
+| Parameter   | Required | Description                                         
                                                                                
                                                                                
               | Default |
+|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| `bucket`    | Yes      | The S3 bucket to which the files are exported to.   
                                                                                
                                                                                
               | n/a     |
+| `prefix`    | Yes      | Path where the exported files would be created. The 
export query would expect the destination to be empty. If the location includes 
other files, then the query will fail.                                          
               | n/a     |
+
+The following runtime parameters must be configured to export into an S3 
destination:
+
+| Runtime Parameter                            | Required | Description        
                                                                                
                                                                                
                                                | Default |
+|----------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
+| `druid.export.storage.s3.tempSubDir`         | Yes      | Directory used to 
store temporary files required while uploading the data.                        
                                                                                
                                                 | n/a |
+| `druid.export.storage.s3.allowedExportPaths` | Yes      | An array of S3 
prefixes which are whitelisted as export destinations. Export query fail if the 
export destination does not match any of the configured prefixes. Example: 
`[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]`     | n/a |
+| `druid.export.storage.s3.maxRetry`           | No       | Defines the max 
number times to attempt S3 API calls to avoid failures due to transient errors. 
                                                                                
                                                   | 10  |
+| `druid.export.storage.s3.chunkSize`          | No       | Defines the size 
of each chunk to temporarily store in `tempDir`. The chunk size must be between 
5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it 
requires more disk space to store the temporary chunks. | 100MiB |
+
+##### LOCAL
+
+Exporting is also supported to the local storage, which exports the results to 
the filesystem of the MSQ worker.
+This is useful in a single node setup or for testing, and is not suitable for 
production use cases.
+
+This can be done by passing the function `LOCAL()` as an argument to the 
`EXTERN FUNCTION`.
+Arguments to `LOCAL()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+To use local as an export destination, the runtime property 
`druid.export.storage.baseDir` must be configured on the indexer/middle manager.
+The parameter provided to the `LOCAL()` function will be prefixed with this 
value when exporting to a local destination.
+
+```sql
+INSERT INTO
+  EXTERN(
+    local(exportPath => 'exportLocation/query1')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:
+
+| Parameter   | Required | Description                                         
                                                                                
                                                                                
                     | Default |
+|-------------|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 --|
+| `exportPath`  | Yes | Subdirectory of `druid.export.storage.baseDir` used to 
as the destination to export the results to. The export query expects the 
destination to be empty. If the location includes other files or directories, 
then the query will fail. | n/a |

Review Comment:
   ```suggestion
   | `exportPath`  | Yes | Subdirectory of `druid.export.storage.baseDir` used 
as the destination to export the results to. The export query expects the 
destination to be empty. If the location includes other files or directories, 
then the query will fail. | n/a |
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+```sql
+INSERT INTO
+  EXTERN(
+    S3(bucket => 's3://your_bucket', prefix => 'prefix/to/files')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:
+
+| Parameter   | Required | Description                                         
                                                                                
                                                                                
               | Default |
+|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| `bucket`    | Yes      | The S3 bucket to which the files are exported to.   
                                                                                
                                                                                
               | n/a     |
+| `prefix`    | Yes      | Path where the exported files would be created. The 
export query would expect the destination to be empty. If the location includes 
other files, then the query will fail.                                          
               | n/a     |

Review Comment:
   ```suggestion
   | `prefix`    | Yes      | Path where the exported files would be created. 
The export query expects the destination to be empty. If the location includes 
other files, the query will fail.                                               
          | n/a     |
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+```sql
+INSERT INTO
+  EXTERN(
+    S3(bucket => 's3://your_bucket', prefix => 'prefix/to/files')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:
+
+| Parameter   | Required | Description                                         
                                                                                
                                                                                
               | Default |
+|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| `bucket`    | Yes      | The S3 bucket to which the files are exported to.   
                                                                                
                                                                                
               | n/a     |
+| `prefix`    | Yes      | Path where the exported files would be created. The 
export query would expect the destination to be empty. If the location includes 
other files, then the query will fail.                                          
               | n/a     |
+
+The following runtime parameters must be configured to export into an S3 
destination:
+
+| Runtime Parameter                            | Required | Description        
                                                                                
                                                                                
                                                | Default |
+|----------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
+| `druid.export.storage.s3.tempSubDir`         | Yes      | Directory used to 
store temporary files required while uploading the data.                        
                                                                                
                                                 | n/a |
+| `druid.export.storage.s3.allowedExportPaths` | Yes      | An array of S3 
prefixes which are whitelisted as export destinations. Export query fail if the 
export destination does not match any of the configured prefixes. Example: 
`[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]`     | n/a |
+| `druid.export.storage.s3.maxRetry`           | No       | Defines the max 
number times to attempt S3 API calls to avoid failures due to transient errors. 
                                                                                
                                                   | 10  |
+| `druid.export.storage.s3.chunkSize`          | No       | Defines the size 
of each chunk to temporarily store in `tempDir`. The chunk size must be between 
5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it 
requires more disk space to store the temporary chunks. | 100MiB |
+
+##### LOCAL
+
+Exporting is also supported to the local storage, which exports the results to 
the filesystem of the MSQ worker.
+This is useful in a single node setup or for testing, and is not suitable for 
production use cases.

Review Comment:
   ```suggestion
   This is useful in a single node setup or for testing but is not suitable for 
production use cases.
   ```



##########
docs/multi-stage-query/reference.md:
##########
@@ -90,6 +93,93 @@ can precede the column list: `EXTEND (timestamp VARCHAR...)`.
 
 For more information, see [Read external data with 
EXTERN](concepts.md#read-external-data-with-extern).
 
+#### `EXTERN` to export to a destination
+
+`EXTERN` can be used to specify a destination, where the data needs to be 
exported.
+This variation of EXTERN requires one argument, the details of the destination 
as specified below.
+This variation additionally requires an `AS` clause to specify the format of 
the exported rows.
+
+Only INSERT statements are supported with an `EXTERN` destination.
+Only `CSV` format is supported at the moment.
+Please note that partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED 
BY`) is not currently supported with export statements.
+
+Export statements support the context parameter `rowsPerPage` for the number 
of rows in each exported file. The default value
+is 100,000.
+
+INSERT statements append the results to the existing files at the destination.
+```sql
+INSERT INTO
+  EXTERN(<destination function>)
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Exporting is currently supported for Amazon S3 storage and local storage.
+
+##### S3
+
+Exporting results to S3 can be done by passing the function `S3()` as an 
argument to the `EXTERN` function. The `druid-s3-extensions` should be loaded.
+The `S3()` function is a druid function which configures the connection. 
Arguments to `S3()` should be passed as named parameters with the value in 
single quotes like the example below.
+
+```sql
+INSERT INTO
+  EXTERN(
+    S3(bucket => 's3://your_bucket', prefix => 'prefix/to/files')
+  )
+AS CSV
+SELECT
+  <column>
+FROM <table>
+```
+
+Supported arguments to the function:
+
+| Parameter   | Required | Description                                         
                                                                                
                                                                                
               | Default |
+|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| `bucket`    | Yes      | The S3 bucket to which the files are exported to.   
                                                                                
                                                                                
               | n/a     |
+| `prefix`    | Yes      | Path where the exported files would be created. The 
export query would expect the destination to be empty. If the location includes 
other files, then the query will fail.                                          
               | n/a     |
+
+The following runtime parameters must be configured to export into an S3 
destination:
+
+| Runtime Parameter                            | Required | Description        
                                                                                
                                                                                
                                                | Default |
+|----------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
+| `druid.export.storage.s3.tempSubDir`         | Yes      | Directory used to 
store temporary files required while uploading the data.                        
                                                                                
                                                 | n/a |
+| `druid.export.storage.s3.allowedExportPaths` | Yes      | An array of S3 
prefixes which are whitelisted as export destinations. Export query fail if the 
export destination does not match any of the configured prefixes. Example: 
`[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]`     | n/a |
+| `druid.export.storage.s3.maxRetry`           | No       | Defines the max 
number times to attempt S3 API calls to avoid failures due to transient errors. 
                                                                                
                                                   | 10  |
+| `druid.export.storage.s3.chunkSize`          | No       | Defines the size 
of each chunk to temporarily store in `tempDir`. The chunk size must be between 
5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it 
requires more disk space to store the temporary chunks. | 100MiB |
+
+##### LOCAL
+
+Exporting is also supported to the local storage, which exports the results to 
the filesystem of the MSQ worker.
+This is useful in a single node setup or for testing, and is not suitable for 
production use cases.
+
+This can be done by passing the function `LOCAL()` as an argument to the 
`EXTERN FUNCTION`.

Review Comment:
   ```suggestion
   Export results to local storage by passing the function `LOCAL()` as an 
argument for the `EXTERN FUNCTION`. To use local storage as an export 
destination, the runtime property `druid.export.storage.baseDir` must be 
configured on the Indexer/Middle Manager.
   The parameter provided to the `LOCAL()` function will be prefixed with this 
value when exporting to a local destination.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add export capabilities to MSQ with SQL syntax (druid)

Reply via email to