[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-11-01 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-8671:
-
   Resolution: Fixed
Fix Version/s: 0.15.0
   Status: Resolved  (was: Patch Available)

Patch committed to trunk and branch-0.14.

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0, 0.15.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch, HIVE-8671.3.patch, 
 HIVE-8671.4.patch, HIVE-8671.5.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
  

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-11-01 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated HIVE-8671:
-
Fix Version/s: (was: 0.15.0)

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch, HIVE-8671.3.patch, 
 HIVE-8671.4.patch, HIVE-8671.5.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 21594638446 Data size: 
 172746300152 Basic stats: 

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-31 Thread Mostafa Mokhtar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Mokhtar updated HIVE-8671:
--
Attachment: HIVE-8671.1.patch

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 21594638446 Data size: 
 172746300152 Basic stats: COMPLETE Column stats: COMPLETE
   Map Join Operator
 

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-31 Thread Mostafa Mokhtar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Mokhtar updated HIVE-8671:
--
Attachment: HIVE-8671.2.patch

[~sershe]

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 21594638446 Data size: 
 172746300152 Basic stats: COMPLETE Column stats: COMPLETE
   Map Join 

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-31 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-8671:
-
Attachment: HIVE-8671.3.patch

Added fix for overflow issue. Merged with changes from [~mmokhtar]'s patch. 
Tested the query on TPCDS 3 scale and the stats looks good without overflow.

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch, HIVE-8671.3.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-31 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-8671:
-
Status: Patch Available  (was: Open)

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch, HIVE-8671.3.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 21594638446 Data size: 
 172746300152 Basic stats: COMPLETE Column stats: COMPLETE
   Map 

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-31 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-8671:
-
Status: Open  (was: Patch Available)

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch, HIVE-8671.3.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 21594638446 Data size: 
 172746300152 Basic stats: COMPLETE Column stats: COMPLETE
   Map 

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-31 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-8671:
-
Attachment: HIVE-8671.4.patch

Addressed [~sershe]'s code review changes.

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch, HIVE-8671.3.patch, 
 HIVE-8671.4.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 21594638446 Data size: 
 172746300152 Basic stats: 

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-31 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-8671:
-
Status: Patch Available  (was: Open)

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch, HIVE-8671.3.patch, 
 HIVE-8671.4.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 21594638446 Data size: 
 172746300152 Basic stats: COMPLETE Column stats: COMPLETE
  

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-31 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-8671:
-
Attachment: HIVE-8671.5.patch

Fixes test failures.

 Overflow in estimate row count and data size with fetch column stats
 

 Key: HIVE-8671
 URL: https://issues.apache.org/jira/browse/HIVE-8671
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Priority: Critical
 Fix For: 0.14.0

 Attachments: HIVE-8671.1.patch, HIVE-8671.2.patch, HIVE-8671.3.patch, 
 HIVE-8671.4.patch, HIVE-8671.5.patch


 Overflow in row counts and data size for several TPC-DS queries.
 Interestingly the operators which have overflow end up running with a small 
 parallelism.
 For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
 {code}
Reducer 2 
 Reduce Operator Tree:
   Group By Operator
 aggregations: sum(VALUE._col0)
 keys: KEY._col0 (type: string), KEY._col1 (type: string), 
 KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
 mode: mergepartial
 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
 Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
 Reduce Output Operator
   key expressions: _col3 (type: string), _col3 (type: string)
   sort order: ++
   Map-reduce partition columns: _col3 (type: string)
   Statistics: Num rows: 9223372036854775807 Data size: 
 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
   value expressions: _col0 (type: string), _col1 (type: 
 string), _col2 (type: string), _col3 (type: string), _col4 (type: float), 
 _col5 (type: double)
 Execution mode: vectorized
 {code}
 {code}
 VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS 
 INPUT_RECORDS   OUTPUT_RECORDS 
 Map 1 62   26.41   1,779,510   
 211,978,502   60,628,390
 Map 5  14.28   6,950   
 138,098  138,098
 Map 6  12.44   3,910
 31   31
 Reducer 2  2   22.69  61,320
 60,628,390   69,182
 Reducer 3  12.63   3,910
 69,182  100
 Reducer 4  11.01   1,180   
 100  100
 {code}
 Query
 {code}
 explain  
 select  i_item_desc 
   ,i_category 
   ,i_class 
   ,i_current_price
   ,i_item_id
   ,sum(ws_ext_sales_price) as itemrevenue 
   ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
   (partition by i_class) as revenueratio
 from  
   web_sales
   ,item 
   ,date_dim
 where 
   web_sales.ws_item_sk = item.i_item_sk 
   and item.i_category in ('Jewelry', 'Sports', 'Books')
   and web_sales.ws_sold_date_sk = date_dim.d_date_sk
   and date_dim.d_date between '2001-01-12' and '2001-02-11'
 group by 
   i_item_id
 ,i_item_desc 
 ,i_category
 ,i_class
 ,i_current_price
 order by 
   i_category
 ,i_class
 ,i_item_id
 ,i_item_desc
 ,revenueratio
 limit 100
 {code}
 Explain 
 {code}
 STAGE PLANS:
   Stage: Stage-1
 Tez
   Edges:
 Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
 Reducer 2 - Map 1 (SIMPLE_EDGE)
 Reducer 3 - Reducer 2 (SIMPLE_EDGE)
 Reducer 4 - Reducer 3 (SIMPLE_EDGE)
   DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
   Vertices:
 Map 1 
 Map Operator Tree:
 TableScan
   alias: web_sales
   filterExpr: ws_item_sk is not null (type: boolean)
   Statistics: Num rows: 21594638446 Data size: 2850189889652 
 Basic stats: COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: ws_item_sk is not null (type: boolean)
 Statistics: Num rows: 21594638446 Data size: 172746300152 
 Basic stats: COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: ws_item_sk (type: int), ws_ext_sales_price 
 (type: float), ws_sold_date_sk (type: int)
   outputColumnNames: _col0, _col1, _col2
   Statistics: Num rows: 21594638446 Data size: 
 172746300152 Basic stats: 

[jira] [Updated] (HIVE-8671) Overflow in estimate row count and data size with fetch column stats

2014-10-30 Thread Mostafa Mokhtar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Mokhtar updated HIVE-8671:
--
Description: 
Overflow in row counts and data size for several TPC-DS queries.
Interestingly the operators which have overflow end up running with a small 
parallelism.

For instance Reducer 2 has an overflow but it only runs with parallelism of 2.
{code}
   Reducer 2 
Reduce Operator Tree:
  Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string), 
KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 9223372036854775807 Data size: 
9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
  key expressions: _col3 (type: string), _col3 (type: string)
  sort order: ++
  Map-reduce partition columns: _col3 (type: string)
  Statistics: Num rows: 9223372036854775807 Data size: 
9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE
  value expressions: _col0 (type: string), _col1 (type: 
string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 
(type: double)
Execution mode: vectorized
{code}

{code}
VERTEX   TOTAL_TASKSDURATION_SECONDS CPU_TIME_MILLIS INPUT_RECORDS  
 OUTPUT_RECORDS 
Map 1 62   26.41   1,779,510   211,978,502  
 60,628,390
Map 5  14.28   6,950   138,098  
138,098
Map 6  12.44   3,91031  
 31
Reducer 2  2   22.69  61,32060,628,390  
 69,182
Reducer 3  12.63   3,91069,182  
100
Reducer 4  11.01   1,180   100  
100
{code}

Query
{code}
explain  
select  i_item_desc 
  ,i_category 
  ,i_class 
  ,i_current_price
  ,i_item_id
  ,sum(ws_ext_sales_price) as itemrevenue 
  ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
  (partition by i_class) as revenueratio
from
web_sales
,item 
,date_dim
where 
web_sales.ws_item_sk = item.i_item_sk 
and item.i_category in ('Jewelry', 'Sports', 'Books')
and web_sales.ws_sold_date_sk = date_dim.d_date_sk
and date_dim.d_date between '2001-01-12' and '2001-02-11'
group by 
i_item_id
,i_item_desc 
,i_category
,i_class
,i_current_price
order by 
i_category
,i_class
,i_item_id
,i_item_desc
,revenueratio
limit 100
{code}

Explain 
{code}
STAGE PLANS:
  Stage: Stage-1
Tez
  Edges:
Map 1 - Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE)
Reducer 2 - Map 1 (SIMPLE_EDGE)
Reducer 3 - Reducer 2 (SIMPLE_EDGE)
Reducer 4 - Reducer 3 (SIMPLE_EDGE)
  DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: web_sales
  filterExpr: ws_item_sk is not null (type: boolean)
  Statistics: Num rows: 21594638446 Data size: 2850189889652 
Basic stats: COMPLETE Column stats: COMPLETE
  Filter Operator
predicate: ws_item_sk is not null (type: boolean)
Statistics: Num rows: 21594638446 Data size: 172746300152 
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
  expressions: ws_item_sk (type: int), ws_ext_sales_price 
(type: float), ws_sold_date_sk (type: int)
  outputColumnNames: _col0, _col1, _col2
  Statistics: Num rows: 21594638446 Data size: 172746300152 
Basic stats: COMPLETE Column stats: COMPLETE
  Map Join Operator
condition map:
 Inner Join 0 to 1
condition expressions:
  0 {_col0} {_col1}
  1 
keys:
  0 _col2 (type: int)
  1 _col0 (type: int)
outputColumnNames: _col0, _col1
input vertices:
  1 Map 6
Statistics: Num rows: 24145061366 Data size: 
193160490928 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator