Hari Sankar Sivarama Subramaniyan created HIVE-7751:
-------------------------------------------------------

             Summary: Mapjoin set in a non-conditional task  can fail in MR 
mode because of  memory overhead issues
                 Key: HIVE-7751
                 URL: https://issues.apache.org/jira/browse/HIVE-7751
             Project: Hive
          Issue Type: Bug
            Reporter: Hari Sankar Sivarama Subramaniyan
            Assignee: Hari Sankar Sivarama Subramaniyan


select sum(ss_quantity) from store_sales join store on store.s_store_sk = 
store_sales.ss_store_sk join customer_demographics on 
customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk join 
customer_address on store_sales.ss_addr_sk = customer_address.ca_address_sk 
join date_dim on store_sales.ss_sold_date_sk = date_dim.d_date_sk where d_year 
= 2000 and ((cd_marital_status = 'M' and cd_education_status = 'Advanced 
Degree' and ss_sales_price between 100.00 and 150.00) or (cd_marital_status = 
'M' and cd_education_status = 'Advanced Degree' and ss_sales_price between 
50.00 and 100.00) or (cd_marital_status = 'M' and cd_education_status = 
'Advanced Degree' and ss_sales_price between 150.00 and 200.00)) and 
((ca_country = 'United States' and ca_state in ('TX', 'OH', 'TX') and 
ss_net_profit between 0 and 2000) or (ca_country = 'United States' and ca_state 
in ('OR', 'MN', 'KY') and ss_net_profit between 150 and 3000) or (ca_country = 
'United States' and ca_state in ('VA', 'TX', 'MS') and ss_net_profit between 50 
and 25000));

The above query where the data is stored as orc format can fail because we 
convert the above join to a non-conditional task assuming that mapjoin would 
succeed at runtime. But at runtime, the query can fail due to memory overhead 
issues. The improvement to prevent such failures would be to use table 
statistics instead of calling ql.exec.Utilities.getTotalInputFileSize() inside 
the CommonJoinTaskDispatcher. This would make sure that we take better 
decisions for MR mode. Tez on the other hand would handle such scenarios better 
because it actaully relies on table stats to get the data size.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to