[
https://issues.apache.org/jira/browse/HIVE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145206#comment-14145206
]
Mostafa Mokhtar commented on HIVE-8196:
---------------------------------------
Explain after applying patch
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Map 1 <- Map 2 (BROADCAST_EDGE)
DagName: mmokhtar_20140923143232_119110fe-94e2-4c06-8aae-18ab93967a84:1
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: ss_sold_date_sk is not null (type: boolean)
Statistics: Num rows: 550076554 Data size: 24008004411 Basic
stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ss_sold_date_sk is not null (type: boolean)
Statistics: Num rows: 550076554 Data size: 2200306216 Basic
stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {ss_sold_date_sk}
1 {d_date_sk} {d_date}
keys:
0 ss_sold_date_sk (type: int)
1 d_date_sk (type: int)
outputColumnNames: _col0, _col27, _col29
input vertices:
1 Map 2
Statistics: Num rows: 614126563 Data size: 62640909426
Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (_col0 = _col27) (type: boolean)
Statistics: Num rows: 307063281 Data size: 31320454662
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col29 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 307063281 Data size:
28863948414 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 307063281 Data size:
28863948414 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format:
org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: date_dim
filterExpr: (d_date_sk is not null and (d_year = 1998))
(type: boolean)
Statistics: Num rows: 73049 Data size: 81741831 Basic stats:
COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (d_date_sk is not null and (d_year = 1998))
(type: boolean)
Statistics: Num rows: 652 Data size: 66504 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: d_date_sk (type: int)
sort order: +
Map-reduce partition columns: d_date_sk (type: int)
Statistics: Num rows: 652 Data size: 66504 Basic stats:
COMPLETE Column stats: COMPLETE
value expressions: d_date (type: string)
Execution mode: vectorized
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
> Joining on partition columns with fetch column stats enabled results it very
> small CE which negatively affects query performance
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-8196
> URL: https://issues.apache.org/jira/browse/HIVE-8196
> Project: Hive
> Issue Type: Bug
> Components: Physical Optimizer
> Affects Versions: 0.14.0
> Reporter: Mostafa Mokhtar
> Assignee: Prasanth J
> Priority: Critical
> Labels: performance
> Fix For: 0.14.0
>
> Attachments: HIVE-8196.1.patch
>
>
> To make the best out of dynamic partition pruning joins should be on the
> partitioning columns which results in dynamically pruning the partitions from
> the fact table based on the qualifying column keys from the dimension table,
> this type of joins negatively effects on cardinality estimates with fetch
> column stats enabled.
> Currently we don't have statistics for partition columns and as a result NDV
> is set to row count, doing that negatively affects the estimated join
> selectivity from the join.
> Workaround is to capture statistics for partition columns or use number of
> partitions incase dynamic partitioning is used.
> In StatsUtils.getColStatisticsFromExpression is where count distincts gets
> set to row count
> {code}
> if (encd.getIsPartitionColOrVirtualCol()) {
> // vitual columns
> colType = encd.getTypeInfo().getTypeName();
> countDistincts = numRows;
> oi = encd.getWritableObjectInspector();
> {code}
> Query used to repro the issue :
> {code}
> set hive.stats.fetch.column.stats=true;
> set hive.tez.dynamic.partition.pruning=true;
> explain select d_date
> from store_sales, date_dim
> where
> store_sales.ss_sold_date_sk = date_dim.d_date_sk and
> date_dim.d_year = 1998;
> {code}
> Plan
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> Edges:
> Map 1 <- Map 2 (BROADCAST_EDGE)
> DagName: mmokhtar_20140919180404_945d29f5-d041-4420-9666-1c5d64fa6540:8
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> filterExpr: ss_sold_date_sk is not null (type: boolean)
> Statistics: Num rows: 550076554 Data size: 47370018816
> Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {ss_sold_date_sk}
> 1 {d_date_sk} {d_date}
> keys:
> 0 ss_sold_date_sk (type: int)
> 1 d_date_sk (type: int)
> outputColumnNames: _col22, _col26, _col28
> input vertices:
> 1 Map 2
> Statistics: Num rows: 652 Data size: 66504 Basic stats:
> COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (_col22 = _col26) (type: boolean)
> Statistics: Num rows: 326 Data size: 33252 Basic stats:
> COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: _col28 (type: string)
> outputColumnNames: _col0
> Statistics: Num rows: 326 Data size: 30644 Basic
> stats: COMPLETE Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 326 Data size: 30644 Basic
> stats: COMPLETE Column stats: COMPLETE
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Execution mode: vectorized
> Map 2
> Map Operator Tree:
> TableScan
> alias: date_dim
> filterExpr: (d_date_sk is not null and (d_year = 1998))
> (type: boolean)
> Statistics: Num rows: 73049 Data size: 81741831 Basic
> stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (d_date_sk is not null and (d_year = 1998))
> (type: boolean)
> Statistics: Num rows: 652 Data size: 66504 Basic stats:
> COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: d_date_sk (type: int)
> sort order: +
> Map-reduce partition columns: d_date_sk (type: int)
> Statistics: Num rows: 652 Data size: 66504 Basic stats:
> COMPLETE Column stats: COMPLETE
> value expressions: d_date (type: string)
> Select Operator
> expressions: d_date_sk (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 652 Data size: 0 Basic stats:
> PARTIAL Column stats: COMPLETE
> Group By Operator
> keys: _col0 (type: int)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 652 Data size: 0 Basic stats:
> PARTIAL Column stats: COMPLETE
> Dynamic Partitioning Event Operator
> Target Input: store_sales
> Partition key expr: ss_sold_date_sk
> Statistics: Num rows: 652 Data size: 0 Basic stats:
> PARTIAL Column stats: COMPLETE
> Target column: ss_sold_date_sk
> Target Vertex: Map 1
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)