Re: [APE] Support COPY TO

Till Westmann Thu, 26 Oct 2023 10:05:24 -0700

Agreed (+1) - this functionality will significantly reduce the pain of moving 
data in and out of object storage.


> On Oct 25, 2023, at 10:04 PM, Mike Carey <dtab...@gmail.com> wrote:
> 
> +1  -- Now maybe users will stop trying to retrieve huge results and 
> wondering why the UI is choking! :-) This capability is actually long overdue.
> 
> On 10/24/23 9:53 AM, Wail Alkowaileet wrote:
>> Currently, AsterixDB does not have a clean way to extract query results or
>> dump a dataset to a storage device. The only channel provided currently is
>> the Query Service (i.e., running the query and writing it somehow at the
>> client side). We need to support a way to write query results (or dump a
>> dataset) in parallel to a storage device.
>> 
>> To illustrate, say we want to do the following:
>> 
>>> USE CopyToDataverse;
>> COPY ColumnDataset
>>> TO localfs
>>> PATH("localhost:///media/backup/CopyToResult")
>>> WITH {
>>>     "format" : "json"
>>> };
>> In this example, the data in ColumnDataset will be written in each node at
>> the provided path localhost:///media/backup/CopyToResult. Simply, each node
>> will write its own partitions for the data stored in ColumnDataset locally.
>> The written files will be in raw JSON format.
>> 
>> Another example:
>> 
>>> USE CopyToDataverse;
>>> COPY (SELECT cd.uid uid,
>>>                             cd.sensor_info.name name,
>>>                             to_bigint(cd.sensor_info.battery_status)
>>> battery_status
>>>              FROM ColumnDataset cd
>>> ) toWrite
>>> TO s3
>>> PATH("CopyToResult/" || to_string(b))
>>> OVER (
>>>    PARTITION BY toWrite.battery_status b
>>>    ORDER BY toWrite.name
>>> )
>>> WITH {
>>>     "format" : "json",
>>>     "compression": "gzip",
>>>     "max-objects-per-file": 100,
>>>     "container": "myBucket",
>>>     "accessKeyId": "<access-key>",
>>>     "secretAccessKey": "<secret-key>",
>>>     "region": "us-west-2"
>>> };
>> The second example shows how to write the result of a query and also
>> partition the result so that each partition will be written to a certain
>> path. In this example, we partition by the battery_status (say an integer
>> value from 0 to 100). The final result will be written to myBucke in Amazon
>> S3.
>> 
>> Each partition will have the path CopyToResult/<battery_status>. For
>> example CopyToResult/0, CopyToResult/1 ..., CopyToResult/99,
>> CopyToResult/100). This partitioning scheme can be useful if a user wants
>> to exploit dynamic prefixes (external filters) (see ASTERIXDB-3073
>> <https://issues.apache.org/jira/browse/ASTERIXDB-3073>).
>> 
>> Additionally, the records in each partition will be ordered by the
>> sensor_name (toWrite.name). Note that this ordering isn't global but per
>> partition.
>> 
>> Also, the written files will be compressed using *gzip* and each file
>> should have at most 100 records max (*max-objects-per-file*).
>> 
>> EPIC: ASTERIXDB-3286<https://issues.apache.org/jira/browse/ASTERIXDB-3286>

Re: [APE] Support COPY TO

Reply via email to