[ 
https://issues.apache.org/jira/browse/TAJO-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225601#comment-15225601
 ] 

ASF GitHub Bot commented on TAJO-2111:
--------------------------------------

Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/994#issuecomment-205623205
  
    Here is my benchmark results as follows.
    
    # Configuration
    
    * EC2 instance type : c3.xlarge
    * Cluster: 1 master, 3 worker
    * Dataset: TPC-H (factor = 1)
    * Partition table schema
    ```
    CREATE EXTERNAL TABLE lineitem_p (l_orderkey INT8, l_partkey INT8, 
l_suppkey INT8, l_linenumber INT8, l_quantity FLOAT8, l_extendedprice FLOAT8, 
l_discount FLOAT8, l_tax FLOAT8, l_returnflag TEXT, l_linestatus TEXT, 
l_commitdate text, l_receiptdate text, l_shipinstruct TEXT, l_shipmode TEXT, 
l_comment TEXT)
    USING TEXT WITH ('text.delimiter'='|')
    PARTITION BY COLUMN(l_shipdate text)
    LOCATION 's3://Xyz';
    ```
    * Partition numbers of ``lineitem`` table: 2526 (each partitions includes 
just one file)
    
    # Queries
    * Q1: `` select * from lineitem_p limit 5; ``
    * Q2: `` select count(*) from lineitem_p; ``
    * Q3: `` select count(*) from lineitem_p where l_shipdate > '1994-09-25' 
and l_shipdate < '1994-10-10'; ``
    
    # Query Execution Time
    
    Query | No Optimized | Optimized | Improvement
    
-------------------|----------------------|--------------------------|-------------------
    Q1 | 573.425 sec | 4.228 sec | 135.6x
    Q2 | 653.175 sec | 33.444 sec | 19.5x
    Q3 | 4.099 sec | 2.429 sec | 1.6x
    
    
    # Split Computation Time
    
    Query | No Optimized | Optimized | Improvement
    
-------------------|----------------------|--------------------------|-------------------
    Q1 | 572921 ms  | 2233  ms | 256.5x
    Q2 | 599437 ms | 701 ms | 855.1x
    Q3 | 2537 ms | 388 ms | 6.5x


> Optimize Partition Table Split Computation for Amazon S3
> --------------------------------------------------------
>
>                 Key: TAJO-2111
>                 URL: https://issues.apache.org/jira/browse/TAJO-2111
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: S3, Storage
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>
> Currently, Split computation of partitioned table proceed as follows.
> * Listing all partition directories of specified partitioned table 
> * Listing all files of each partition directories
> For examples, assume a table with 1000 partitions and each partitions 
> includes 10 files. In above case, AWS S3 API will be called 1000 times and it 
> will become a huge bottleneck.
> To improve current computation, we have to use {{S3::listObjects}} and 
> implement the following algorithm to efficiently list multiple input 
> locations:
> * Given a list of S3 locations, apply prefix listing to a common prefix to 
> get the metadata of 1000 objects at a time.
> * While applying prefix listing in the above step, skip those objects that do 
> not fall under the input list of S3 locations to avoid ending up listing 
> large number of irrelevant objects in pathogenic cases.
> Honestly, I'm inspired by Qubole's blog post as follows 
> https://www.qubole.com/blog/product/optimizing-s3-bulk-listings-for-performant-hive-queries/.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to