[jira] [Updated] (TEZ-2496) Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source

Rajesh Balamohan (JIRA) Sun, 13 Dec 2015 18:18:00 -0800

     [ 
https://issues.apache.org/jira/browse/TEZ-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajesh Balamohan updated TEZ-2496:
----------------------------------
    Attachment:     (was: TEZ-2496.9.branch-0.7.patch)

> Consider scheduling tasks in ShuffleVertexManager based on the partition 
> sizes from the source
> ----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2496
>                 URL: https://issues.apache.org/jira/browse/TEZ-2496
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>             Fix For: 0.8.0-alpha
>
>         Attachments: TEZ-2496.1.patch, TEZ-2496.2.patch, TEZ-2496.3.patch, 
> TEZ-2496.4.patch, TEZ-2496.5.patch, TEZ-2496.6.patch, TEZ-2496.7.patch, 
> TEZ-2496.8.patch, TEZ-2496.8.patch, TEZ-2496.9.addendum.patch, 
> TEZ-2496.9.branch-0.7.patch, TEZ-2496.9.patch
>
>
> Consider scheduling tasks in ShuffleVertexManager based on the partition 
> sizes from the source.  This would be helpful in scenarios, where there is 
> limited resources (or concurrent jobs running or multiple waves) with 
> dataskew and the task which gets large amount of data gets sceheduled much 
> later.
> e.g Consider the following hive query running in a queue with limited 
> capacity (42 slots in total) @ 200 GB scale
> {noformat}
> CREATE TEMPORARY TABLE sampleData AS
>   SELECT CASE
>            WHEN ss_sold_time_sk IS NULL THEN 70429
>            ELSE ss_sold_time_sk
>        END AS ss_sold_time_sk,
>        ss_item_sk,
>        ss_customer_sk,
>        ss_cdemo_sk,
>        ss_hdemo_sk,
>        ss_addr_sk,
>        ss_store_sk,
>        ss_promo_sk,
>        ss_ticket_number,
>        ss_quantity,
>        ss_wholesale_cost,
>        ss_list_price,
>        ss_sales_price,
>        ss_ext_discount_amt,
>        ss_ext_sales_price,
>        ss_ext_wholesale_cost,
>        ss_ext_list_price,
>        ss_ext_tax,
>        ss_coupon_amt,
>        ss_net_paid,
>        ss_net_paid_inc_tax,
>        ss_net_profit,
>        ss_sold_date_sk
>   FROM store_sales distribute by ss_sold_time_sk;
> {noformat}
> This generated 39 maps and 134 reduce slots (3 reduce waves). When lots of 
> nulls are there for ss_sold_time_sk, it would tend to have data skew towards 
> 70429.  If the reducer which gets this data gets scheduled much earlier (i.e 
> in first wave itself), entire job would finish fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-2496) Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source

Reply via email to