[ https://issues.apache.org/jira/browse/IMPALA-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sailesh Mukil resolved IMPALA-3482. ----------------------------------- Resolution: Duplicate Fix Version/s: Impala 2.9.0 Fixed by IMPALA-4172 > S3: Consider bulk listing of files in the catalog vs individually accessing > them > -------------------------------------------------------------------------------- > > Key: IMPALA-3482 > URL: https://issues.apache.org/jira/browse/IMPALA-3482 > Project: IMPALA > Issue Type: Improvement > Components: Frontend > Affects Versions: Impala 2.6.0 > Reporter: Sailesh Mukil > Assignee: Sailesh Mukil > Priority: Major > Labels: catalog-server, performance, s3 > Fix For: Impala 2.9.0 > > Attachments: invalidate_cs_3.jfr > > > The following query creates 2.4K partitions when using the tpch_300_parquet > dataset: > insert into table tmps3db.orders_part partition(o_orderdate) select > O_ORDERKEY, > O_CUSTKEY, > O_ORDERSTATUS, > O_TOTALPRICE, > O_ORDERPRIORITY, > O_CLERK, > O_SHIPPRIORITY, > O_COMMENT, > O_ORDERDATE > from > tpch_300_parquet.orders > If we skip the staging step (see IMPALA-3452), the INSERTs themselves > complete in less than 2 minutes. However, ~21 minutes is spent in the catalog > after the INSERT to update all the partition information. This is because the > catalog makes multiple individual requests per file per partition to S3. S3 > has employed protection mechanisms to detect and slow down when many > individual requests come from a single IP: > (http://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html) > We should consider listing files in the parent directory of the partitions in > batches of 1000 (see > http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html; for why we > choose 1000) so that the number of requests to S3 is minimized, and we get > steady latency and we also make better use of bandwidth. -- This message was sent by Atlassian JIRA (v7.6.3#76005)