[ https://issues.apache.org/jira/browse/HIVE-12412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Steinbach updated HIVE-12412: ---------------------------------- Labels: Correctness CorrectnessBug (was: ) > Multi insert queries fail to run properly in hive 1.1.x or later. > ----------------------------------------------------------------- > > Key: HIVE-12412 > URL: https://issues.apache.org/jira/browse/HIVE-12412 > Project: Hive > Issue Type: Bug > Affects Versions: 1.2.0, 1.1.0 > Reporter: John P. Petrakis > Labels: Correctness, CorrectnessBug > > We use multi insert queries to take data in one table and manipulate it by > inserting it into a results table. Queries are of this form: > from (select * from data_table lateral view explode(data_table.f2) f2 as > explode_f2) as explode_data_table > insert overwrite table results_table partition (q_id='C.P1',rl='1') > select > array(cast(if(explode_data_table.f1 is null or > explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as > String),cast(explode_f2.s1 as String)) as dimensions, > ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, > null as rownm > where (explode_data_table.date_id between 20151016 and 20151016) > group by > if(explode_data_table.f1 is null or explode_data_table.f1='', > 'UNKNOWN',explode_data_table.f1), > explode_f2.s1 > INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') > SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS > dimensions, > ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, > null AS rownm > WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) > INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') > SELECT > ARRAY(CAST('Total' as String)) AS dimensions, > ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, > null AS rownm > WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) > This query is meant to total a given field of a struct that is potentially a > list of structs. For our test data set, which consists of a single row, the > summation yields "Null", with messages in the hive log of the nature: > Missing fields! Expected 2 fields but only got 1! Ignoring similar problems. > or "Extra fields detected..." > For significantly more data, this query will eventually cause a run time > error while processing a column (caused by array index out of bounds > exception in one of the lazy binary classes such as LazyBinaryString or > LazyBinaryStruct). > Using the query above from the hive command line, the following data was used: > (note there are tabs in the data below) > string one one:1.0:1.00:10.0,eon:1.0:1.00:100.0 > string two two:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0 > string thr three:3.0:3.00:30.0 > string fou four:4.0:4.00:40.0 > There are two fields, a string, (eg. 'string one') and a list of structs. > The following is used to create the table: > create table if not exists t1 ( > f1 string, > f2 > array<struct<s1:string,d1:double,d2:double,d3:double>> > ) > partitioned by (clid string, date_id string) > row format delimited fields > terminated by '09' > collection items terminated by ',' > map keys terminated by ':' > lines terminated by '10' > location '/user/hive/warehouse/t1'; > And the following is used to load the data: > load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE into > table t1 partition(client_id='987654321',date_id='20151016'); > The resulting table should yield the following: > ["string fou","four"] [4.0] null C.P1 1 > ["string one","eon"] [1.0] null C.P1 1 > ["string one","one"] [1.0] null C.P1 1 > ["string thr","three"] [3.0] null C.P1 1 > ["string two","otw"] [2.0] null C.P1 1 > ["string two","two"] [2.0] null C.P1 1 > ["string two","wott"] [2.0] null C.P1 1 > ["Total","Total"] [15.0] null C.P2 0 > ["Total"] [15.0] null C.P5 0 > However what we get is: > Hive Runtime Error while processing row > {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array > index out of bounds exception) > If we reduce the above data to a SINGLE row, the we don't get an exception > but the total fields come out as NULL. > The ONLY way this query would work is > 1) if I added a group by (date_id) or even group by ('') as the last line in > the query... or removed the last where clause for the final insert. (The > reason why we have a where clause on the initial select is due to push down > predicates not working... and causing a complete scan of the entire table). > or > 2) removing the group by clauses entirely > or using > 3) hive.multigroupby.singlereducer = false > Once 3) is used no modification of the query is needed... it all runs. This > is very similar in nature to the following: > https://issues.apache.org/jira/browse/HIVE-2750 > which is supposedly fixed and closed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)