[jira] [Updated] (HIVE-12412) Multi insert queries fail to run properly in hive 1.1.x or later.

Carl Steinbach (JIRA) Wed, 19 Oct 2016 10:54:21 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-12412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Carl Steinbach updated HIVE-12412:
----------------------------------
    Labels: Correctness CorrectnessBug  (was: )

> Multi insert queries fail to run properly in hive 1.1.x or later.
> -----------------------------------------------------------------
>
>                 Key: HIVE-12412
>                 URL: https://issues.apache.org/jira/browse/HIVE-12412
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.0
>            Reporter: John P. Petrakis
>              Labels: Correctness, CorrectnessBug
>
> We use multi insert queries to take data in one table and manipulate it by 
> inserting it into a results table.  Queries are of this form:
> from (select * from data_table lateral view explode(data_table.f2) f2 as 
> explode_f2) as explode_data_table  
>        insert overwrite table results_table partition (q_id='C.P1',rl='1') 
>        select 
>        array(cast(if(explode_data_table.f1 is null or 
> explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as 
> String),cast(explode_f2.s1 as String)) as dimensions, 
>        ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, 
>        null as rownm 
>        where (explode_data_table.date_id between 20151016 and 20151016)
>        group by 
>        if(explode_data_table.f1 is null or explode_data_table.f1='', 
> 'UNKNOWN',explode_data_table.f1),
>        explode_f2.s1         
>        INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') 
>        SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS 
> dimensions, 
>        ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
>        null AS rownm 
>        WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) 
>        INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') 
>        SELECT 
>        ARRAY(CAST('Total' as String)) AS dimensions, 
>        ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
>        null AS rownm 
>        WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)
> This query is meant to total a given field of a struct that is potentially a 
> list of structs.  For our test data set, which consists of a single row, the 
> summation yields "Null",  with messages in the hive log of the nature:
> Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
> or "Extra fields detected..."
> For significantly more data, this query will eventually cause a run time 
> error while processing a column (caused by array index out of bounds 
> exception in one of the lazy binary classes such as LazyBinaryString or 
> LazyBinaryStruct).
> Using the query above from the hive command line, the following data was used:
> (note there are tabs in the data below)
> string one    one:1.0:1.00:10.0,eon:1.0:1.00:100.0
> string two    two:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
> string thr    three:3.0:3.00:30.0
> string fou    four:4.0:4.00:40.0
> There are two fields, a string, (eg. 'string one') and a list of structs.  
> The following is used to create the table:
> create table if not exists t1 (
>                          f1 string, 
>                           f2 
> array<struct<s1:string,d1:double,d2:double,d3:double>>
>                          )
>                           partitioned by (clid string, date_id string) 
>                           row format delimited fields 
>                          terminated by '09' 
>                          collection items terminated by ',' 
>                          map keys terminated by ':'
>                          lines terminated by '10' 
>                          location '/user/hive/warehouse/t1';
> And the following is used to load the data:
> load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE  into 
> table t1  partition(client_id='987654321',date_id='20151016');
> The resulting table should yield the following:
> ["string fou","four"] [4.0]   null    C.P1    1       
> ["string one","eon"]  [1.0]   null    C.P1    1       
> ["string one","one"]  [1.0]   null    C.P1    1       
> ["string thr","three"]        [3.0]   null    C.P1    1       
> ["string two","otw"]  [2.0]   null    C.P1    1       
> ["string two","two"]  [2.0]   null    C.P1    1       
> ["string two","wott"] [2.0]   null    C.P1    1       
> ["Total","Total"]     [15.0]  null    C.P2    0       
> ["Total"]     [15.0]  null    C.P5    0       
> However what we get is:
> Hive Runtime Error while processing row 
> {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array 
> index out of bounds exception)
> If we reduce the above data to a SINGLE row, the we don't get an exception 
> but the total fields come out as NULL.
> The ONLY way this query would work is 
> 1) if I added a group by (date_id) or even group by ('') as the last line in 
> the query... or removed the last where clause for the final insert.  (The 
> reason why we have a where clause on the initial select is due to push down 
> predicates not working... and causing a complete scan of the entire table).
> or 
> 2) removing the group by clauses entirely 
> or using
> 3) hive.multigroupby.singlereducer = false
> Once 3) is used no modification of the query is needed... it all runs.  This 
> is very similar in nature to the following:
> https://issues.apache.org/jira/browse/HIVE-2750
> which is supposedly fixed and closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-12412) Multi insert queries fail to run properly in hive 1.1.x or later.

Reply via email to