Re: [APE] Support COPY TO

Mike Carey Wed, 25 Oct 2023 22:04:11 -0700

+1 -- Now maybe users will stop trying to retrieve huge results andwondering why the UI is choking! :-) This capability is actually longoverdue.


On 10/24/23 9:53 AM, Wail Alkowaileet wrote:

Currently, AsterixDB does not have a clean way to extract query results or
dump a dataset to a storage device. The only channel provided currently is
the Query Service (i.e., running the query and writing it somehow at the
client side). We need to support a way to write query results (or dump a
dataset) in parallel to a storage device.


To illustrate, say we want to do the following:

USE CopyToDataverse;

COPY ColumnDataset

TO localfs
PATH("localhost:///media/backup/CopyToResult")
WITH {
     "format" : "json"
};

In this example, the data in ColumnDataset will be written in each node at
the provided path localhost:///media/backup/CopyToResult. Simply, each node
will write its own partitions for the data stored in ColumnDataset locally.
The written files will be in raw JSON format.

Another example:

USE CopyToDataverse;
COPY (SELECT cd.uid uid,
                             cd.sensor_info.name name,
                             to_bigint(cd.sensor_info.battery_status)
battery_status
              FROM ColumnDataset cd
) toWrite
TO s3
PATH("CopyToResult/" || to_string(b))
OVER (
    PARTITION BY toWrite.battery_status b
    ORDER BY toWrite.name
)
WITH {
     "format" : "json",
     "compression": "gzip",
     "max-objects-per-file": 100,
     "container": "myBucket",
     "accessKeyId": "<access-key>",
     "secretAccessKey": "<secret-key>",
     "region": "us-west-2"
};

The second example shows how to write the result of a query and also
partition the result so that each partition will be written to a certain
path. In this example, we partition by the battery_status (say an integer
value from 0 to 100). The final result will be written to myBucke in Amazon
S3.

Each partition will have the path CopyToResult/<battery_status>. For
example CopyToResult/0, CopyToResult/1 ..., CopyToResult/99,
CopyToResult/100). This partitioning scheme can be useful if a user wants
to exploit dynamic prefixes (external filters) (see ASTERIXDB-3073
<https://issues.apache.org/jira/browse/ASTERIXDB-3073>).

Additionally, the records in each partition will be ordered by the
sensor_name (toWrite.name). Note that this ordering isn't global but per
partition.

Also, the written files will be compressed using *gzip* and each file
should have at most 100 records max (*max-objects-per-file*).

EPIC: ASTERIXDB-3286<https://issues.apache.org/jira/browse/ASTERIXDB-3286>

Re: [APE] Support COPY TO

Reply via email to