Hari Sankar Sivarama Subramaniyan created HIVE-7751:
-------------------------------------------------------
Summary: Mapjoin set in a non-conditional task can fail in MR
mode because of memory overhead issues
Key: HIVE-7751
URL: https://issues.apache.org/jira/browse/HIVE-7751
Project: Hive
Issue Type: Bug
Reporter: Hari Sankar Sivarama Subramaniyan
Assignee: Hari Sankar Sivarama Subramaniyan
select sum(ss_quantity) from store_sales join store on store.s_store_sk =
store_sales.ss_store_sk join customer_demographics on
customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk join
customer_address on store_sales.ss_addr_sk = customer_address.ca_address_sk
join date_dim on store_sales.ss_sold_date_sk = date_dim.d_date_sk where d_year
= 2000 and ((cd_marital_status = 'M' and cd_education_status = 'Advanced
Degree' and ss_sales_price between 100.00 and 150.00) or (cd_marital_status =
'M' and cd_education_status = 'Advanced Degree' and ss_sales_price between
50.00 and 100.00) or (cd_marital_status = 'M' and cd_education_status =
'Advanced Degree' and ss_sales_price between 150.00 and 200.00)) and
((ca_country = 'United States' and ca_state in ('TX', 'OH', 'TX') and
ss_net_profit between 0 and 2000) or (ca_country = 'United States' and ca_state
in ('OR', 'MN', 'KY') and ss_net_profit between 150 and 3000) or (ca_country =
'United States' and ca_state in ('VA', 'TX', 'MS') and ss_net_profit between 50
and 25000));
The above query where the data is stored as orc format can fail because we
convert the above join to a non-conditional task assuming that mapjoin would
succeed at runtime. But at runtime, the query can fail due to memory overhead
issues. The improvement to prevent such failures would be to use table
statistics instead of calling ql.exec.Utilities.getTotalInputFileSize() inside
the CommonJoinTaskDispatcher. This would make sure that we take better
decisions for MR mode. Tez on the other hand would handle such scenarios better
because it actaully relies on table stats to get the data size.
--
This message was sent by Atlassian JIRA
(v6.2#6252)