[jira] [Commented] (HIVE-14082) Multi-Insert Query Fails with GROUP BY, DISTINCT, and WHERE clauses

Sahil Takiar (JIRA) Thu, 23 Jun 2016 10:13:14 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-14082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346777#comment-15346777
 ]


Sahil Takiar commented on HIVE-14082:
-------------------------------------

After a lot of investigation into these two errors, they seem to be slightly 
related.

The TL;DR is that I think there is a bug in the class {{ReduceSinkOperator}}, 
but I am not 100% sure.

I started debugging Exception 1 first. It seems that 
{{GenericUDFOPEqualOrGreaterThan}} tries to compare a {{Text}} object with 
value {{value_1_3}} to a {{DateWritable}} object with value {{2015-06-22}}. 
This happens inside the reduce method, and causes the code to throw a 
{{ClassCastException}} and fail.

I compared the query plans for Query 1 vs. Query 3 (the output of {{EXPLAIN 
...}}). Both plans require a single Map-Reduce job. The major difference is 
that the Query 3 runs a Filter Operator in both the Map Task and the Reduce 
Task, while Query 1 only runs a Filter Operator in the Map Task. I suspect the 
Filter Operator in the Reduce Task of Query 3 is having some type of issue.

Here is the output of the {{EXPLAIN}} query:

{code}
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                               
    Explain                                                                     
              |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:                                                           
                                                                                
              |
|   Stage-2 is a root stage                                                     
                                                                                
              |
|   Stage-0 depends on stages: Stage-2                                          
                                                                                
              |
|   Stage-3 depends on stages: Stage-0                                          
                                                                                
              |
|   Stage-1 depends on stages: Stage-2                                          
                                                                                
              |
|   Stage-4 depends on stages: Stage-1                                          
                                                                                
              |
|                                                                               
                                                                                
              |
| STAGE PLANS:                                                                  
                                                                                
              |
|   Stage: Stage-2                                                              
                                                                                
              |
|     Map Reduce                                                                
                                                                                
              |
|       Map Operator Tree:                                                      
                                                                                
              |
|           TableScan                                                           
                                                                                
              |
|             alias: multi_table_insert_source                                  
                                                                                
              |
|             Statistics: Num rows: 5 Data size: 250 Basic stats: COMPLETE 
Column stats: NONE                                                              
                   |
|             Filter Operator                                                   
                                                                                
              |
|               predicate: ((date_column >= 2013-06-21) or (date_column >= 
2015-06-22)) (type: boolean)                                                    
                   |
|               Statistics: Num rows: 2 Data size: 100 Basic stats: COMPLETE 
Column stats: NONE                                                              
                 |
|               Select Operator                                                 
                                                                                
              |
|                 expressions: column_1 (type: string), column_2 (type: 
string), column_3 (type: string), date_column (type: date)                      
                      |
|                 outputColumnNames: column_1, column_2, column_3, date_column  
                                                                                
              |
|                 Statistics: Num rows: 2 Data size: 100 Basic stats: COMPLETE 
Column stats: NONE                                                              
               |
|                 Reduce Output Operator                                        
                                                                                
              |
|                   key expressions: column_1 (type: string), column_2 (type: 
string), column_3 (type: string), date_column (type: date)                      
                |
|                   sort order: ++++                                            
                                                                                
              |
|                   Map-reduce partition columns: column_1 (type: string), 
column_2 (type: string)                                                         
                   |
|                   Statistics: Num rows: 2 Data size: 100 Basic stats: 
COMPLETE Column stats: NONE                                                     
                      |
|       Reduce Operator Tree:                                                   
                                                                                
              |
|         Forward                                                               
                                                                                
              |
|           Statistics: Num rows: 2 Data size: 100 Basic stats: COMPLETE Column 
stats: NONE                                                                     
              |
|           Filter Operator                                                     
                                                                                
              |
|             predicate: (KEY._col2:1._col0 >= 2015-06-22) (type: boolean)      
                                                                                
              |
|             Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE 
Column stats: NONE                                                              
                    |
|             Group By Operator                                                 
                                                                                
              |
|               aggregations: count(), count(DISTINCT KEY._col2:0._col0), 
count(DISTINCT KEY._col2:1._col0, KEY._col2:1._col1)                            
                    |
|               keys: KEY._col0 (type: string), KEY._col1 (type: string)        
                                                                                
              |
|               mode: complete                                                  
                                                                                
              |
|               outputColumnNames: _col0, _col1, _col2, _col3, _col4            
                                                                                
              |
|               Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE 
Column stats: NONE                                                              
                  |
|               Select Operator                                                 
                                                                                
              |
|                 expressions: _col0 (type: string), _col1 (type: string), 
UDFToInteger(_col2) (type: int), UDFToInteger(_col3) (type: int), 
UDFToInteger(_col4) (type: int)  |
|                 outputColumnNames: _col0, _col1, _col2, _col3, _col4          
                                                                                
              |
|                 Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE 
Column stats: NONE                                                              
                |
|                 File Output Operator                                          
                                                                                
              |
|                   compressed: false                                           
                                                                                
              |
|                   Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE 
Column stats: NONE                                                              
              |
|                   table:                                                      
                                                                                
              |
|                       input format: org.apache.hadoop.mapred.TextInputFormat  
                                                                                
              |
|                       output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                      
                                                       |
|                       serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                              
                                                               |
|                       name: multi_table_insert_bug.multi_table_insert_test    
                                                                                
              |
|           Filter Operator                                                     
                                                                                
              |
|             predicate: (KEY._col2:1._col0 >= 2013-06-21) (type: boolean)      
                                                                                
              |
|             Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE 
Column stats: NONE                                                              
                    |
|             Group By Operator                                                 
                                                                                
              |
|               aggregations: count(), count(DISTINCT KEY._col2:0._col0), 
count(DISTINCT KEY._col2:1._col0, KEY._col2:1._col1)                            
                    |
|               keys: KEY._col0 (type: string), KEY._col1 (type: string)        
                                                                                
              |
|               mode: complete                                                  
                                                                                
              |
|               outputColumnNames: _col0, _col1, _col2, _col3, _col4            
                                                                                
              |
|               Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE 
Column stats: NONE                                                              
                  |
|               Select Operator                                                 
                                                                                
              |
|                 expressions: _col0 (type: string), _col1 (type: string), 
UDFToInteger(_col2) (type: int), UDFToInteger(_col3) (type: int), 
UDFToInteger(_col4) (type: int)  |
|                 outputColumnNames: _col0, _col1, _col2, _col3, _col4          
                                                                                
              |
|                 Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE 
Column stats: NONE                                                              
                |
|                 File Output Operator                                          
                                                                                
              |
|                   compressed: false                                           
                                                                                
              |
|                   Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE 
Column stats: NONE                                                              
              |
|                   table:                                                      
                                                                                
              |
|                       input format: org.apache.hadoop.mapred.TextInputFormat  
                                                                                
              |
|                       output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                      
                                                       |
|                       serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                              
                                                               |
|                       name: multi_table_insert_bug.multi_table_insert_test    
                                                                                
              |
|                                                                               
                                                                                
              |
|   Stage: Stage-0                                                              
                                                                                
              |
|     Move Operator                                                             
                                                                                
              |
|       tables:                                                                 
                                                                                
              |
|           partition:                                                          
                                                                                
              |
|             partition_column 365                                              
                                                                                
              |
|           replace: true                                                       
                                                                                
              |
|           table:                                                              
                                                                                
              |
|               input format: org.apache.hadoop.mapred.TextInputFormat          
                                                                                
              |
|               output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                      
                                                               |
|               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
                                                                                
              |
|               name: multi_table_insert_bug.multi_table_insert_test            
                                                                                
              |
|                                                                               
                                                                                
              |
|   Stage: Stage-3                                                              
                                                                                
              |
|     Stats-Aggr Operator                                                       
                                                                                
              |
|                                                                               
                                                                                
              |
|   Stage: Stage-1                                                              
                                                                                
              |
|     Move Operator                                                             
                                                                                
              |
|       tables:                                                                 
                                                                                
              |
|           partition:                                                          
                                                                                
              |
|             partition_column 1096                                             
                                                                                
              |
|           replace: true                                                       
                                                                                
              |
|           table:                                                              
                                                                                
              |
|               input format: org.apache.hadoop.mapred.TextInputFormat          
                                                                                
              |
|               output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                      
                                                               |
|               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
                                                                                
              |
|               name: multi_table_insert_bug.multi_table_insert_test            
                                                                                
              |
|                                                                               
                                                                                
              |
|   Stage: Stage-4                                                              
                                                                                
              |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                               
    Explain                                                                     
              |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|     Stats-Aggr Operator                                                       
                                                                                
              |
|                                                                               
                                                                                
              |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
{code}

The {{ReduceSinkOperator}} class is emitting the record:

*Record 1:*

{code}
{"key":{"_col0":"value_1_1","_col1":"value_1_2","_col2":{0:{"_col0":"value_1_3"}}},"value":null}
{code}

Which does not contain the {{date_column}} column, causing the 
{{FilterOperator}} in the {{ExecReducer}} to fail.

Once I determined that {{ReduceSinkOperator}} is emitting this record, I did my 
best to trace through the logic of {{ReduceSinkOperator}} to figure out why it 
is emitting Record 1. The {{ReduceSinkOperator}} seems to emit a single record 
for each {{DISTINCT}} keyword in the query, but the record will only have a 
subset of its original columns. For each {{DISTINCT}} clause it will emit the 
columns in the {{DISTINCT}} clause itself, as well as the columns in the 
{{GROUP BY}}. For example, for {{COUNT(DISTINCT column_3)}} it will emit 
{{column_1}}, {{column_2}}, and {{column_3}}, but not the {{date_column}}. I'm 
not sure the {{ReduceSinkOperator}} takes into account the situation where a 
{{FilterOperator}} can occur on the reduce-side of a Hive query, over a column 
not in the {{DISTINCT}} or {{GROUP BY}} clause.

Exception 2 seems to be related to Exception 1. Exception 2 is thrown in the 
class {{OpProcFactory.ReduceSinkLineage.process(...)}} method. This method does 
some type of processing on the {{ReduceSinkOperator}} class (the same class 
mentioned above when analyzing Exception 1), so my guess is that they are 
related.

> Multi-Insert Query Fails with GROUP BY, DISTINCT, and WHERE clauses
> -------------------------------------------------------------------
>
>                 Key: HIVE-14082
>                 URL: https://issues.apache.org/jira/browse/HIVE-14082
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.1.0, 2.1.0
>            Reporter: Sahil Takiar
>
> The following MULTI-INSERT Query Fails in Hive. I've listed the query 
> required to re-produce this failure, as well as a few similar queries that 
> work properly.
> Setup Queries:
> {code}
> DROP SCHEMA IF EXISTS multi_table_insert_bug CASCADE;
> CREATE SCHEMA multi_table_insert_bug;
> USE multi_table_insert_bug;
> DROP TABLE IF EXISTS multi_table_insert_source;
> DROP TABLE IF EXISTS multi_table_insert_test;
> CREATE TABLE multi_table_insert_source (
>   date_column DATE,
>   column_1 STRING,
>   column_2 STRING,
>   column_3 STRING,
>   column_4 STRING
> );
> CREATE TABLE multi_table_insert_test (
>   column_1 STRING,
>   column_2 STRING,
>   line_count INT,
>   distinct_count_by_1_column INT,
>   distinct_count_by_2_columns INT
> )
> PARTITIONED BY (partition_column INT);
> INSERT OVERWRITE TABLE multi_table_insert_source VALUES
>   ('2016-01-22', 'value_1_1', 'value_1_2', 'value_1_3', 'value_1_4'),
>   ('2016-01-22', 'value_2_1', 'value_2_2', 'value_2_3', 'value_2_4'),
>   ('2016-01-22', 'value_3_1', 'value_3_2', 'value_3_3', 'value_3_4'),
>   ('2016-01-22', 'value_4_1', 'value_4_2', 'value_4_3', 'value_4_4'),
>   ('2016-01-22', 'value_5_1', 'value_5_2', 'value_5_3', 'value_5_4');
> {code}
> The following queries run successfully:
> *Query 1:*
> {code}
> FROM multi_table_insert_source
>   INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column 
> = 365)
>   SELECT
>     column_1,
>     column_2,
>     COUNT(*) AS line_count,
>     COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
>     COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
>   WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 365)
>   GROUP BY
>     column_1,
>     column_2;
> {code}
> *Query 2:*
> {code}
> FROM multi_table_insert_source
>   INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column 
> = 365)
>   SELECT
>     column_1,
>     column_2,
>     COUNT(*) AS line_count,
>     COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
>     COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
> --  WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 365)
>   GROUP BY
>     column_1,
>     column_2
>   INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column 
> = 1096)
>   SELECT
>     column_1,
>     column_2,
>     COUNT(*) AS line_count,
>     COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
>     COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
> --  WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 1096)
>   GROUP BY
>     column_1,
>     column_2;
> {code}
> The following query fails with a {{ClassCastException}}:
> *Query 3:*
> {code}
> FROM multi_table_insert_source
>   INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column 
> = 365)
>   SELECT
>     column_1,
>     column_2,
>     COUNT(*) AS line_count,
>     COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
>     COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
>   WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 365)
>   GROUP BY
>     column_1,
>     column_2
>   INSERT OVERWRITE TABLE multi_table_insert_test PARTITION (partition_column 
> = 1096)
>   SELECT
>     column_1,
>     column_2,
>     COUNT(*) AS line_count,
>     COUNT(DISTINCT column_3) AS distinct_count_by_1_column,
>     COUNT(DISTINCT date_column, column_3) AS distinct_count_by_2_columns
>   WHERE date_column >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()), 1096)
>   GROUP BY
>     column_1,
>     column_2;
> {code}
> Here is the full stack-trace of the exception:
> *Exception 1:*
> {code}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row (tag=0) 
> {"key":{"_col0":"value_1_1","_col1":"value_1_2","_col2":{0:{"_col0":"value_1_3"}}},"value":null}
>       at 
> org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:257)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:506)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:449)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row (tag=0) 
> {"key":{"_col0":"value_1_1","_col1":"value_1_2","_col2":{0:{"_col0":"value_1_3"}}},"value":null}
>       at 
> org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:245)
>       ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be 
> cast to org.apache.hadoop.hive.serde2.io.DateWritable
>       at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableDateObjectInspector.getPrimitiveWritableObject(WritableDateObjectInspector.java:38)
>       at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:938)
>       at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:818)
>       at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.compare(ObjectInspectorUtils.java:809)
>       at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPEqualOrGreaterThan.evaluate(GenericUDFOPEqualOrGreaterThan.java:141)
>       at 
> org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
>       at 
> org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
>       at 
> org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
>       at 
> org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:112)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:878)
>       at 
> org.apache.hadoop.hive.ql.exec.ForwardOperator.process(ForwardOperator.java:38)
>       at 
> org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:236)
>       ... 3 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-14082) Multi-Insert Query Fails with GROUP BY, DISTINCT, and WHERE clauses

Reply via email to