[
https://issues.apache.org/jira/browse/HIVE-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aditya Allamraju reassigned HIVE-17355:
---------------------------------------
Assignee: (was: Aditya Allamraju)
> Casting to Decimal along with UNION ALL gives incosistent results
> -----------------------------------------------------------------
>
> Key: HIVE-17355
> URL: https://issues.apache.org/jira/browse/HIVE-17355
> Project: Hive
> Issue Type: Bug
> Components: Parser, UDF
> Affects Versions: 2.1.0, 2.1.1
> Environment: CentOS 7.2
> Reporter: Aditya Allamraju
>
> Extra trailing zeros are added when running "union all" on the tables
> containing decimal data types.
> *Version:* Hive 2.1
> *Steps to repro:-*
> {code:java}
> 1) CREATE TABLE `decisample`(
> `a` decimal(8,2),
> `b` int,
> `c` decimal(5,2))
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
> 'maprfs:/user/hive/warehouse/decisample'
> 2) CREATE TABLE `decisample3`(
> `a` decimal(8,2),
> `b` int,
> `c` decimal(5,2))
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
> 'maprfs:/user/hive/warehouse/decisample3'
> 3)hive> select * from decisample3;
> OK
> 1.00 2 3.00
> 7.00 8 9.00
> 4)hive> select * from decisample;
> OK
> 4.00 5 6.00
> 5) query:-
> select a1.a, '' as a1b,'' as a1c from decisample a1 union all select
> a2.a,a2.b,a2.c from decisample3 a2;
> o/p:-
> OK
> 4.00 NULL
> 1.00 2 3.000000000000000000
> 7.00 8 9.000000000000000000
> Time taken: 87.993 seconds, Fetched: 3 row(s)
> 6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as
> a1b,'' as a1c from decisample a1;
> o/p:-
> 4.00
> 1.00 2 3
> 7.00 8 9
> {code}
> Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no
> trailing zero.
> Observation:
> 1. Hive is trying to run the UNION ALL after ensuring the SELECT's are
> semantically same(equal number of columns and same datatypes). To do this, it
> is implicitly type casting the values where required.
> From the explain plan, type casting is not consistent when done 2 different
> ways:
> a) select-1 UNION ALL select-2 (Query-5 in above comment)
> vs
> b) select-2 UNION ALL select-2 (Query-6 in above comment)
> Showing only the "expresssions" part of execution plans
> Query-5:
> ========
> {code:java}
> ..
> ..
> Map Operator Tree:
> TableScan
> alias: a1
> Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE
> Column stats: NONE
> Select Operator
> expressions: a (type: decimal(8,2)), '' (type: string), null
> (type: decimal(38,18))
> outputColumnNames: _col0, _col1, _col2
> ..
> ..
> TableScan
> alias: a2
> Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE
> Column stats: NONE
> Select Operator
> expressions: a (type: decimal(8,2)), UDFToString(b) (type:
> string), CAST( c AS decimal(38,18)) (type: decimal(38,18))
> {code}
> Query-6:
> ========
> {code:java}
> ..
> ..
> Map Operator Tree:
> TableScan
> alias: a2
> Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE
> Column stats: NONE
> Select Operator
> expressions: a (type: decimal(8,2)), UDFToString(b) (type:
> string), UDFToString(c) (type: string)
> ..
> ..
> TableScan
> alias: a1
> Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE
> Column stats: NONE
> Select Operator
> expressions: a (type: decimal(8,2)), '' (type: string), ''
> (type: string)
> ..
> ..
> {code}
> Attaching the execution plans for both queries for reference.
> 2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
> And by default, the precision and scale are taken as (38,18) in Hive. This
> could be the reason for 18 zeros.
> 3. This is repeating every time implicit type casting is happening on EMPTY
> strings.
> If excluding few columns in one of the SELECT statement is absolutely
> necessary, then the only Workaround is to explicitly type cast the empty
> strings to same Datatypes as the Other Select statement which included the
> columns.
> For ex:
> Q1:
> select a,b,c from decisample3
> union all
> select a,cast(' ' as int),cast(' ' as decimal) from decisample;
> Q2:
> select a,cast(' ' as int),cast(' ' as decimal) from decisample
> union all
> select a,b,c from decisample3;
> Both the above queries will give consistent result now.
> cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
> cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)