[jira] [Resolved] (ASTERIXDB-3286) Support COPY TO

Wail Y. Alkowaileet (Jira) Fri, 12 Jan 2024 13:29:07 -0800


     [ 
https://issues.apache.org/jira/browse/ASTERIXDB-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wail Y. Alkowaileet resolved ASTERIXDB-3286.
--------------------------------------------
    Resolution: Fixed

> Support COPY TO
> ---------------
>
>                 Key: ASTERIXDB-3286
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-3286
>             Project: Apache AsterixDB
>          Issue Type: Epic
>          Components: COMP - Compiler, RT - Runtime, SQL - Translator SQL++
>    Affects Versions: 0.9.9
>            Reporter: Wail Y. Alkowaileet
>            Assignee: Wail Y. Alkowaileet
>            Priority: Major
>              Labels: triaged
>             Fix For: 0.9.9
>
>
> Currently, AsterixDB do not have a clean way to extract query result or dump 
> a dataset to a storage device. The only channel provided currently is the 
> Query Service (i.e., running the query and write it somehow at the client 
> side). We need to support a way to write query results (or dump a dataset) in 
> parallel to a storage device.
>  
> To illustrate we want to do the following:
> {noformat}
> USE CopyToDataverse;
> COPY ColumnDataset
> TO localfs
> PATH("localhost:///media/backup/CopyToResult")
> WITH {
>     "format" : "json"
> };{noformat}
> In this example, the data in ColumnDataset will be written in each node at 
> the provided path localhost:///media/backup/CopyToResult. Simply, each node 
> will write its own partitions of the data of ColumnDataset locally. The 
> written files will be in raw JSON format.
>  
> Another example:
> {noformat}
> USE CopyToDataverse;
> COPY (SELECT cd.uid uid, 
>              cd.sensor_info.name name, 
>              to_bigint(cd.sensor_info.battery_status) battery_status
>       FROM ColumnDataset cd
> ) toWrite
> TO s3 
> PATH("CopyToResult/" || to_string(b))
> OVER (
>    PARTITION BY toWrite.battery_status b
>    ORDER BY toWrite.name
> )
> WITH {
>     "format" : "json",
>     "compression": "gzip",
>     "max-objects-per-file": 100,
>     "container": "myBucket",
>     "accessKeyId": "<access-key>",
>     "secretAccessKey": "<secret-key>",
>     "region": "us-west-2"
> };{noformat}
> The second example shows how to write the result of a query and also 
> partition the result so that each partition will be written to a certain 
> path. In this example, we partition by the battery_status (say an integer 
> value from 0 to 100). The final result will be written to myBucke in Amazon 
> S3. 
> Each partition will have the path CopyToResult/<battery_status>. For example 
> CopyToResult/0, CopyToResult/1 ..., CopyToResult/99, CopyToResult/100). This 
> partitioning scheme can be useful if a user wants to exploit dynamic prefixes 
> (external data filters) (see ASTERIXDB-3073)
> Additionally, the records in each partition will be ordered by the 
> sensor_name (toWrite.name). Note that this ordering isn't global but per 
> partition.
> Also, the written files will be compressed using *gzip* and each file should 
> have at most 100 records max ({*}max-objects-per-file{*}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ASTERIXDB-3286) Support COPY TO

Reply via email to