[jira] [Updated] (TAJO-2030) Use list S3 files using AmazonS3Client instead of using S3A

Jaehwa Jung (JIRA) Mon, 04 Jan 2016 00:36:12 -0800

     [ 
https://issues.apache.org/jira/browse/TAJO-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jaehwa Jung updated TAJO-2030:
------------------------------
    Description: 
AWS S3 provides bulk listing API. It takes the common prefix of all input paths 
as a parameter and returns all the objects whose prefixes start with the common 
prefix in blocks of 1000.

If we will use AmazonS3Client for listing S3 files instead of using S3A, this 
will improve performance. To prove this idea, I adopted PrestoFileSystem 
instead of S3AFileSystem. When pruning partition filters, PrestoFileSystem was 
faster much more than S3AFileSystem.

Here is my benchmark results for the following queries:
{code}
1 partition : select count(*) from lineitem where l_shipdate = '1992-01-02';
30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' 
and l_shipdate < '1992-02-01';
90 partitions: select count(*) from lineitem  where l_shipdate >= '1992-01-01' 
and l_shipdate < '1992-04-01';
151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' 
and l_shipdate < '1992-06-01';
{code}

|| (#) of partitions||PrestoFileSystem(ms)||S3AFileSystem(ms)||
|1|677|800|
|30|2753|6977|
|90|6825|13772|
|151|13834|25701|

For the reference, I used tpc-h 1g dataset and set {{l_shipdate}} column of 
{{lineitem}} table to partition column.

I think there are ways to resolve this as following:
- Borrow PrestoFileSystem and related codes from Presto
- Implement necessary codes to S3TableSpace by referencing Presto

  was:
AWS S3 provides bulk listing API. It takes the common prefix of all input paths 
as a parameter and returns all the objects whose prefixes start with the common 
prefix in blocks of 1000.

If we will use AmazonS3Client for listing S3 files instead of using S3A, this 
will improve performance. To prove this idea, I adopted PrestoFileSystem 
instead of S3AFileSystem. When pruning partition filters, PrestoFileSystem was 
faster much more than S3AFileSystem.

Here is my benchmark results for the following queries:
{code}
1 partition : select count(*) from lineitem where l_shipdate = '1992-01-02';
30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' 
and l_shipdate < '1992-02-01';
90 partitions: select count(*) from lineitem  where l_shipdate >= '1992-01-01' 
and l_shipdate < '1992-04-01';
151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' 
and l_shipdate < '1992-06-01';
{code}

|| (#) of partitions||PrestoFileSystem(ms)||S3AFileSystem(ms)||
|1|677|800|
|30|2753|6977|
|90|6825|13772|
|151|13834|25701|

For the reference, I used tpc-h 1g dataset and set {{l_shipdate}} column of 
{{lineitem}} table to partition column.

I think there are ways to resolve this as following:
- Borrow PrestoFileSystem and related codes from Presto
- Implement necessary codes to S3TableSpace by referencing Presto

But first way would make a big issue compare than actual problem. So, I want to 
approach second way. What do you think about it?


> Use list S3 files using AmazonS3Client instead of using S3A
> -----------------------------------------------------------
>
>                 Key: TAJO-2030
>                 URL: https://issues.apache.org/jira/browse/TAJO-2030
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: S3
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>             Fix For: 0.12.0
>
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> If we will use AmazonS3Client for listing S3 files instead of using S3A, this 
> will improve performance. To prove this idea, I adopted PrestoFileSystem 
> instead of S3AFileSystem. When pruning partition filters, PrestoFileSystem 
> was faster much more than S3AFileSystem.
> Here is my benchmark results for the following queries:
> {code}
> 1 partition : select count(*) from lineitem where l_shipdate = '1992-01-02';
> 30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' 
> and l_shipdate < '1992-02-01';
> 90 partitions: select count(*) from lineitem  where l_shipdate >= 
> '1992-01-01' and l_shipdate < '1992-04-01';
> 151 partitions: select count(*) from lineitem where l_shipdate >= 
> '1992-01-01' and l_shipdate < '1992-06-01';
> {code}
> || (#) of partitions||PrestoFileSystem(ms)||S3AFileSystem(ms)||
> |1|677|800|
> |30|2753|6977|
> |90|6825|13772|
> |151|13834|25701|
> For the reference, I used tpc-h 1g dataset and set {{l_shipdate}} column of 
> {{lineitem}} table to partition column.
> I think there are ways to resolve this as following:
> - Borrow PrestoFileSystem and related codes from Presto
> - Implement necessary codes to S3TableSpace by referencing Presto



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TAJO-2030) Use list S3 files using AmazonS3Client instead of using S3A

Reply via email to