[
https://issues.apache.org/jira/browse/IMPALA-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong resolved IMPALA-2753.
-----------------------------------
Resolution: Won't Fix
This isnt really feasible to fix with Hive's traditional partitioning scheme
but looks like there is a better solution for iceberg tables.
> Investigate performance gains for adding random prefix to file name
> -------------------------------------------------------------------
>
> Key: IMPALA-2753
> URL: https://issues.apache.org/jira/browse/IMPALA-2753
> Project: IMPALA
> Issue Type: Sub-task
> Components: Perf Investigation
> Affects Versions: Impala 2.5.0
> Reporter: Mostafa Mokhtar
> Priority: Minor
> Labels: s3
>
> I noticed which is not directly related to Impala is that the file naming
> convention HDFS produces is the anti pattern of what S3 recommends.
> If we do a trick with the naming convention we can one up Hive when running
> on S3.
> {code}
> examplebucket/2013-26-05-15-00-00/cust1234234/photo1.jpg
> examplebucket/2013-26-05-15-00-00/cust3857422/photo2.jpg
> examplebucket/2013-26-05-15-00-00/cust8474937/photo2.jpg
> examplebucket/2013-26-05-15-00-00/cust1248473/photo3.jpg
> ...
> examplebucket/2013-26-05-15-00-01/cust1248473/photo4.jpg
> examplebucket/2013-26-05-15-00-01/cust1248473/photo5.jpg
> examplebucket/2013-26-05-15-00-01/cust1248473/photo6.jpg
> examplebucket/2013-26-05-15-00-01/cust1248473/photo7.jpg
> ...
> {code}
> The sequence pattern in the key names introduces a performance problem. To
> understand the issue, let’s look at how Amazon S3 stores key names.
> Amazon S3 maintains an index of object key names in each AWS region. Object
> keys are stored lexicographically across multiple partitions in the index.
> That is, Amazon S3 stores key names in alphabetical order. The key name
> dictates which partition the key is stored in. Using a sequential prefix,
> such as timestamp or an alphabetical sequence, increases the likelihood that
> Amazon S3 will target a specific partition for a large number of your keys,
> overwhelming the I/O capacity of the partition. If you introduce some
> randomness in your key name prefixes, the key names, and therefore the I/O
> load, will be distributed across more than one partition.
> If you anticipate that your workload will consistently exceed 100 requests
> per second, you should avoid sequential key names. If you must use sequential
> numbers or date and time patterns in key names, add a random prefix to the
> key name. The randomness of the prefix more evenly distributes key names
> across multiple index partitions. Examples of introducing randomness are
> provided later in this topic.
> http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
--
This message was sent by Atlassian Jira
(v8.3.4#803005)