Jaehwa Jung created TAJO-2111:
---------------------------------
Summary: Optimize Partition Table Split Computation for Amazon S3
Key: TAJO-2111
URL: https://issues.apache.org/jira/browse/TAJO-2111
Project: Tajo
Issue Type: Sub-task
Components: S3, Storage
Reporter: Jaehwa Jung
Assignee: Jaehwa Jung
Currently, Split computation of partitioned table proceed as follows.
* Listing all partition directories of specified partitioned table
* Listing all files of each partition directories
For examples, assume a table with 1000 partitions and each partitions includes
10 files. In above case, AWS S3 API will be called 1000 times and it will
become a huge bottleneck.
To improve current computation, we have to use {{S3::listObjects}} and
implement the following algorithm to efficiently list multiple input locations:
* Given a list of S3 locations, apply prefix listing to a common prefix to get
the metadata of 1000 objects at a time.
* While applying prefix listing in the above step, skip those objects that do
not fall under the input list of S3 locations to avoid ending up listing large
number of irrelevant objects in pathogenic cases.
Honestly, I'm inspired by Qubole's blog post as follows
https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)