[jira] [Resolved] (IMPALA-3482) S3: Consider bulk listing of files in the catalog vs individually accessing them

Sailesh Mukil (JIRA) Tue, 10 Apr 2018 09:56:34 -0700

     [ 
https://issues.apache.org/jira/browse/IMPALA-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sailesh Mukil resolved IMPALA-3482.
-----------------------------------
       Resolution: Duplicate
    Fix Version/s: Impala 2.9.0

Fixed by IMPALA-4172

> S3: Consider bulk listing of files in the catalog vs individually accessing 
> them
> --------------------------------------------------------------------------------
>
>                 Key: IMPALA-3482
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3482
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.6.0
>            Reporter: Sailesh Mukil
>            Assignee: Sailesh Mukil
>            Priority: Major
>              Labels: catalog-server, performance, s3
>             Fix For: Impala 2.9.0
>
>         Attachments: invalidate_cs_3.jfr
>
>
> The following query creates 2.4K partitions when using the tpch_300_parquet 
> dataset:
> insert into table tmps3db.orders_part partition(o_orderdate) select 
>     O_ORDERKEY,
>     O_CUSTKEY,
>     O_ORDERSTATUS,
>     O_TOTALPRICE,
>     O_ORDERPRIORITY,
>     O_CLERK,
>     O_SHIPPRIORITY,
>     O_COMMENT,
>     O_ORDERDATE
> from
>     tpch_300_parquet.orders
> If we skip the staging step (see IMPALA-3452), the INSERTs themselves 
> complete in less than 2 minutes. However, ~21 minutes is spent in the catalog 
> after the INSERT to update all the partition information. This is because the 
> catalog makes multiple individual requests per file per partition to S3. S3 
> has employed protection mechanisms to detect and slow down when many 
> individual requests come from a single IP:
> (http://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html)
> We should consider listing files in the parent directory of the partitions in 
> batches of 1000 (see 
> http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html; for why we 
> choose 1000) so that the number of requests to S3 is minimized, and we get 
> steady latency and we also make better use of bandwidth.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (IMPALA-3482) S3: Consider bulk listing of files in the catalog vs individually accessing them

Reply via email to