Wail Y. Alkowaileet created ASTERIXDB-3286: ----------------------------------------------
Summary: Support COPY TO Key: ASTERIXDB-3286 URL: https://issues.apache.org/jira/browse/ASTERIXDB-3286 Project: Apache AsterixDB Issue Type: Epic Components: COMP - Compiler, RT - Runtime Affects Versions: 0.9.9 Reporter: Wail Y. Alkowaileet Assignee: Wail Y. Alkowaileet Fix For: 0.9.9 Currently, AsterixDB do not have a clean way to extract query result or dump a dataset to a storage device. The only channel provided currently is the Query Service (i.e., running the query and write it somehow at the client side). We need to support a way to write query results (or dump a dataset) in parallel to a storage device. To illustrate we want to do the following: {noformat} USE CopyToDataverse; COPY ColumnDataset TO localfs PATH("localhost:///media/backup/CopyToResult") WITH { "format" : "json" };{noformat} In this example, the data in ColumnDataset will be written in each node at the provided path localhost:///media/backup/CopyToResult. Simply, each node will write its own partitions of the data of ColumnDataset locally. The written files will be in raw JSON format. Another example: {noformat} USE CopyToDataverse; COPY (SELECT cd.uid uid, cd.sensor_info.name name, to_bigint(cd.sensor_info.battery_status) battery_status FROM ColumnDataset cd ) toWrite TO s3 PATH("CopyToResult/" || to_string(b)) OVER ( PARTITION BY toWrite.battery_status b ORDER BY toWrite.name ) WITH { "format" : "json", "compression": "gzip", "max-objects-per-file": 100, "container": "myBucket", "accessKeyId": "<access-key>", "secretAccessKey": "<secret-key>", "region": "us-west-2" };{noformat} The second example shows how to write the result of a query and also partition the result so that each partition will be written to a certain path. In this example, we partition by the battery_status (say an integer value from 0 to 100). The final result will be written to myBucke in Amazon S3. Each partition will have the path CopyToResult/<battery_status>. For example CopyToResult/0, CopyToResult/1 ..., CopyToResult/99, CopyToResult/100). This partitioning scheme can be useful if a user wants to exploit dynamic prefixes (external data filters) (see ASTERIXDB-3073) Additionally, the records in each partition will be ordered by the sensor_name (toWrite.name). Note that this ordering isn't global but per partition. Also, the written files will be compressed using *gzip* and each file should have at most 100 records max ({*}max-objects-per-file{*}). -- This message was sent by Atlassian Jira (v8.20.10#820010)