[
https://issues.apache.org/jira/browse/TAJO-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225601#comment-15225601
]
ASF GitHub Bot commented on TAJO-2111:
--------------------------------------
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/994#issuecomment-205623205
Here is my benchmark results as follows.
# Configuration
* EC2 instance type : c3.xlarge
* Cluster: 1 master, 3 worker
* Dataset: TPC-H (factor = 1)
* Partition table schema
```
CREATE EXTERNAL TABLE lineitem_p (l_orderkey INT8, l_partkey INT8,
l_suppkey INT8, l_linenumber INT8, l_quantity FLOAT8, l_extendedprice FLOAT8,
l_discount FLOAT8, l_tax FLOAT8, l_returnflag TEXT, l_linestatus TEXT,
l_commitdate text, l_receiptdate text, l_shipinstruct TEXT, l_shipmode TEXT,
l_comment TEXT)
USING TEXT WITH ('text.delimiter'='|')
PARTITION BY COLUMN(l_shipdate text)
LOCATION 's3://Xyz';
```
* Partition numbers of ``lineitem`` table: 2526 (each partitions includes
just one file)
# Queries
* Q1: `` select * from lineitem_p limit 5; ``
* Q2: `` select count(*) from lineitem_p; ``
* Q3: `` select count(*) from lineitem_p where l_shipdate > '1994-09-25'
and l_shipdate < '1994-10-10'; ``
# Query Execution Time
Query | No Optimized | Optimized | Improvement
-------------------|----------------------|--------------------------|-------------------
Q1 | 573.425 sec | 4.228 sec | 135.6x
Q2 | 653.175 sec | 33.444 sec | 19.5x
Q3 | 4.099 sec | 2.429 sec | 1.6x
# Split Computation Time
Query | No Optimized | Optimized | Improvement
-------------------|----------------------|--------------------------|-------------------
Q1 | 572921 ms | 2233 ms | 256.5x
Q2 | 599437 ms | 701 ms | 855.1x
Q3 | 2537 ms | 388 ms | 6.5x
> Optimize Partition Table Split Computation for Amazon S3
> --------------------------------------------------------
>
> Key: TAJO-2111
> URL: https://issues.apache.org/jira/browse/TAJO-2111
> Project: Tajo
> Issue Type: Sub-task
> Components: S3, Storage
> Reporter: Jaehwa Jung
> Assignee: Jaehwa Jung
>
> Currently, Split computation of partitioned table proceed as follows.
> * Listing all partition directories of specified partitioned table
> * Listing all files of each partition directories
> For examples, assume a table with 1000 partitions and each partitions
> includes 10 files. In above case, AWS S3 API will be called 1000 times and it
> will become a huge bottleneck.
> To improve current computation, we have to use {{S3::listObjects}} and
> implement the following algorithm to efficiently list multiple input
> locations:
> * Given a list of S3 locations, apply prefix listing to a common prefix to
> get the metadata of 1000 objects at a time.
> * While applying prefix listing in the above step, skip those objects that do
> not fall under the input list of S3 locations to avoid ending up listing
> large number of irrelevant objects in pathogenic cases.
> Honestly, I'm inspired by Qubole's blog post as follows
> https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)