Jaehwa Jung created TAJO-2111:
---------------------------------

             Summary: Optimize Partition Table Split Computation for Amazon S3
                 Key: TAJO-2111
                 URL: https://issues.apache.org/jira/browse/TAJO-2111
             Project: Tajo
          Issue Type: Sub-task
          Components: S3, Storage
            Reporter: Jaehwa Jung
            Assignee: Jaehwa Jung


Currently, Split computation of partitioned table proceed as follows.

* Listing all partition directories of specified partitioned table 
* Listing all files of each partition directories

For examples, assume a table with 1000 partitions and each partitions includes 
10 files. In above case, AWS S3 API will be called 1000 times and it will 
become a huge bottleneck.

To improve current computation, we have to use {{S3::listObjects}} and 
implement the following algorithm to efficiently list multiple input locations:
* Given a list of S3 locations, apply prefix listing to a common prefix to get 
the metadata of 1000 objects at a time.
* While applying prefix listing in the above step, skip those objects that do 
not fall under the input list of S3 locations to avoid ending up listing large 
number of irrelevant objects in pathogenic cases.

Honestly, I'm inspired by Qubole's blog post as follows 
https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to