[ 
https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121023#comment-16121023
 ] 

liyunzhang_intel edited comment on HIVE-17287 at 8/10/17 4:02 AM:
------------------------------------------------------------------

[~gopalv]:  thanks for your comments
{quote}
As little as I know about Hive-on-Spark, Query67 does not read any row from the 
default partition in MR or Tez.

{quote}
the default_partition stores the data(25.7G) which ss_sold_date_sk is null. In 
MR/Tez, these part of data will not be load because there is no match data in 
join? 

{quote}
I suspect HoS is loading each partition as an independent RDD, which removes 
the effect of SemanticAnalyzer::genNotNullFilterForJoinSourcePlan()?
{quote}
I also see the filter to filter null data in the explain, although HOS load 
partition as an independent RDD, i think the filter should work after loading 
data in theory.
part of explain of query67
{code}
 Map 1 
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_store_sk is not null and ss_item_sk is not 
null) (type: boolean)
                  Statistics: Num rows: 8251124389 Data size: 181524736558 
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (ss_store_sk is not null and ss_item_sk is not 
null) (type: boolean)
                    Statistics: Num rows: 8251124389 Data size: 181524736558 
Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: ss_item_sk (type: bigint), ss_store_sk 
(type: bigint), ss_quantity (type: int), ss_sales_price (type: double), 
ss_sold_date_sk (type: bigint) 
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4
                      Statistics: Num rows: 8251124389 Data size: 181524736558 
Basic stats: COMPLETE Column stats: NONE
                      Map Join Operator
{code}


was (Author: kellyzly):
[~gopalv]:  thanks for your comments
{quote}
As little as I know about Hive-on-Spark, Query67 does not read any row from the 
default partition in MR or Tez.

{quote}
the default_partition stores the data(25.7G) which ss_sold_date_sk is null. In 
MR/Tez, these part of data will not be load because there is no match data in 
join? 

{quote}
I suspect HoS is loading each partition as an independent RDD, which removes 
the effect of SemanticAnalyzer::genNotNullFilterForJoinSourcePlan()?
{quote}
I also see the filter to filter null data in the explain, although hos load 
partition as an independent RDD, i think the filter should work in theory. 
part of explain of query67
{code}
 Map 1 
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_store_sk is not null and ss_item_sk is not 
null) (type: boolean)
                  Statistics: Num rows: 8251124389 Data size: 181524736558 
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (ss_store_sk is not null and ss_item_sk is not 
null) (type: boolean)
                    Statistics: Num rows: 8251124389 Data size: 181524736558 
Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: ss_item_sk (type: bigint), ss_store_sk 
(type: bigint), ss_quantity (type: int), ss_sales_price (type: double), 
ss_sold_date_sk (type: bigint) 
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4
                      Statistics: Num rows: 8251124389 Data size: 181524736558 
Basic stats: COMPLETE Column stats: NONE
                      Map Join Operator
{code}

> HoS can not deal with skewed data group by
> ------------------------------------------
>
>                 Key: HIVE-17287
>                 URL: https://issues.apache.org/jira/browse/HIVE-17287
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>
> In 
> [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql],
>  fact table {{store_sales}} joins with small tables {{date_dim}}, 
> {{item}},{{store}}. After join, groupby the intermediate data.
> Here the data of {{store_sales}} on 3TB tpcds is skewed:  there are 1824 
> partitions. The biggest partition is 25.7G and others are 715M.
> {code}
> hadoop fs -du -h 
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales
> ....
> 715.0 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639
> 713.9 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640
> 714.1 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641
> 712.9 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642
> 25.7 G   
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
> {code}
> The skewed table {{store_sales}} caused the failed job. Is there any way to 
> solve the groupby problem of skewed table?  I tried to enable 
> {{hive.groupby.skewindata}} to first divide the data more evenly then start 
> do group by. But the job still hangs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to