[
https://issues.apache.org/jira/browse/ARROW-11558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291696#comment-17291696
]
Ian Cook commented on ARROW-11558:
----------------------------------
Thanks [~willjones127]—yep, apparently the S3 Select output serialization
formats are currently limited to CSV and JSON. I followed this chain of links
to confirm this:
# S3 Select user guide:
[https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html]
# SelectObjectContent API reference page:
[https://docs.aws.amazon.com/AmazonS3/latest/API/API_SelectObjectContent.html]
# OutputSerialization API reference page:
[https://docs.aws.amazon.com/AmazonS3/latest/API/API_OutputSerialization.html]
(see only CSV and JSON listed there)
This combined with the limited set of object data file formats, encodings, and
compression formats that S3 Select supports certainly makes the practical
applications of S3 Select within Arrow fairly narrow. However it might still be
worth considering whether there are some cases in which it could improve the
speed and cost of retrieving data from S3 in cases where Arrow is running
outside AWS—for example, in cases where the user wants to use Arrow to select
very small numbers of records/fields from very large sets of data files. But it
might be that the complexity of implementing this in Arrow is not warranted
given the narrow range of practical applications.
> [C++] Push down projection and selection to S3 Select
> -----------------------------------------------------
>
> Key: ARROW-11558
> URL: https://issues.apache.org/jira/browse/ARROW-11558
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Ian Cook
> Priority: Major
> Labels: filesystem
>
> Amazon S3 Select [1], an S3 feature generally available since April 2018 [2],
> can improve S3 read performance by allowing S3 clients to use a limited
> subset of SQL to specify projection and selection [3] on data in some formats
> [4]. It would be interesting to try using this in Arrow and to measure its
> effects on S3 read performance under various conditions.
> [1] [https://aws.amazon.com/blogs/aws/s3-glacier-select/]
> [2]
> [https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/]
> [3]
> [https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html]
> [4][https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)