[ 
https://issues.apache.org/jira/browse/IMPALA-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2753.
-----------------------------------
    Resolution: Won't Fix

This isnt really feasible to fix with Hive's traditional partitioning scheme 
but looks like there is a better solution for iceberg tables.

> Investigate performance gains for adding random prefix to file name
> -------------------------------------------------------------------
>
>                 Key: IMPALA-2753
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2753
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Perf Investigation
>    Affects Versions: Impala 2.5.0
>            Reporter: Mostafa Mokhtar
>            Priority: Minor
>              Labels: s3
>
> I noticed which is not directly related to Impala is that the file naming 
> convention HDFS produces is the anti pattern of what S3 recommends. 
> If we do a trick with the naming convention we can one up Hive when running 
> on S3. 
> {code}
> examplebucket/2013-26-05-15-00-00/cust1234234/photo1.jpg
> examplebucket/2013-26-05-15-00-00/cust3857422/photo2.jpg
> examplebucket/2013-26-05-15-00-00/cust8474937/photo2.jpg
> examplebucket/2013-26-05-15-00-00/cust1248473/photo3.jpg
> ...
> examplebucket/2013-26-05-15-00-01/cust1248473/photo4.jpg
> examplebucket/2013-26-05-15-00-01/cust1248473/photo5.jpg
> examplebucket/2013-26-05-15-00-01/cust1248473/photo6.jpg
> examplebucket/2013-26-05-15-00-01/cust1248473/photo7.jpg    
> ...
> {code}
> The sequence pattern in the key names introduces a performance problem. To 
> understand the issue, let’s look at how Amazon S3 stores key names.
> Amazon S3 maintains an index of object key names in each AWS region. Object 
> keys are stored lexicographically across multiple partitions in the index. 
> That is, Amazon S3 stores key names in alphabetical order. The key name 
> dictates which partition the key is stored in. Using a sequential prefix, 
> such as timestamp or an alphabetical sequence, increases the likelihood that 
> Amazon S3 will target a specific partition for a large number of your keys, 
> overwhelming the I/O capacity of the partition. If you introduce some 
> randomness in your key name prefixes, the key names, and therefore the I/O 
> load, will be distributed across more than one partition.
> If you anticipate that your workload will consistently exceed 100 requests 
> per second, you should avoid sequential key names. If you must use sequential 
> numbers or date and time patterns in key names, add a random prefix to the 
> key name. The randomness of the prefix more evenly distributes key names 
> across multiple index partitions. Examples of introducing randomness are 
> provided later in this topic.
> http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to