[
https://issues.apache.org/jira/browse/ASTERIXDB-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wail Y. Alkowaileet resolved ASTERIXDB-3286.
--------------------------------------------
Resolution: Fixed
> Support COPY TO
> ---------------
>
> Key: ASTERIXDB-3286
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-3286
> Project: Apache AsterixDB
> Issue Type: Epic
> Components: COMP - Compiler, RT - Runtime, SQL - Translator SQL++
> Affects Versions: 0.9.9
> Reporter: Wail Y. Alkowaileet
> Assignee: Wail Y. Alkowaileet
> Priority: Major
> Labels: triaged
> Fix For: 0.9.9
>
>
> Currently, AsterixDB do not have a clean way to extract query result or dump
> a dataset to a storage device. The only channel provided currently is the
> Query Service (i.e., running the query and write it somehow at the client
> side). We need to support a way to write query results (or dump a dataset) in
> parallel to a storage device.
>
> To illustrate we want to do the following:
> {noformat}
> USE CopyToDataverse;
> COPY ColumnDataset
> TO localfs
> PATH("localhost:///media/backup/CopyToResult")
> WITH {
> "format" : "json"
> };{noformat}
> In this example, the data in ColumnDataset will be written in each node at
> the provided path localhost:///media/backup/CopyToResult. Simply, each node
> will write its own partitions of the data of ColumnDataset locally. The
> written files will be in raw JSON format.
>
> Another example:
> {noformat}
> USE CopyToDataverse;
> COPY (SELECT cd.uid uid,
> cd.sensor_info.name name,
> to_bigint(cd.sensor_info.battery_status) battery_status
> FROM ColumnDataset cd
> ) toWrite
> TO s3
> PATH("CopyToResult/" || to_string(b))
> OVER (
> PARTITION BY toWrite.battery_status b
> ORDER BY toWrite.name
> )
> WITH {
> "format" : "json",
> "compression": "gzip",
> "max-objects-per-file": 100,
> "container": "myBucket",
> "accessKeyId": "<access-key>",
> "secretAccessKey": "<secret-key>",
> "region": "us-west-2"
> };{noformat}
> The second example shows how to write the result of a query and also
> partition the result so that each partition will be written to a certain
> path. In this example, we partition by the battery_status (say an integer
> value from 0 to 100). The final result will be written to myBucke in Amazon
> S3.
> Each partition will have the path CopyToResult/<battery_status>. For example
> CopyToResult/0, CopyToResult/1 ..., CopyToResult/99, CopyToResult/100). This
> partitioning scheme can be useful if a user wants to exploit dynamic prefixes
> (external data filters) (see ASTERIXDB-3073)
> Additionally, the records in each partition will be ordered by the
> sensor_name (toWrite.name). Note that this ordering isn't global but per
> partition.
> Also, the written files will be compressed using *gzip* and each file should
> have at most 100 records max ({*}max-objects-per-file{*}).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)