Janaki Lahorani created HIVE-19889:
--------------------------------------
Summary: Wrong results due to PPD of non deterministic functions
with CBO
Key: HIVE-19889
URL: https://issues.apache.org/jira/browse/HIVE-19889
Project: Hive
Issue Type: Bug
Reporter: Janaki Lahorani
Assignee: Janaki Lahorani
The following query can give wrong results when CBO is on:
select * from (
select part1,randum123
from (SELECT *, cast(rand() as double) AS randum123 FROM testA where part1='CA'
and part2 = 'ABC') a
where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;
The plan of the query is as follows:
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: testa
Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE
Column stats: NONE
Filter Operator
predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: 'CA' (type: string), rand() (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE
Column stats: NONE
Limit
Number of rows: 20
Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 2290 Basic stats:
COMPLETE Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 20
Processor Tree:
ListSink
The relevant part in the plan is the filter:
Filter Operator
predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
The predicates s.randum123 > 0.25 and s.randum123 > 0.25 were pushed down. And
randum123 was resolved to rand(). This is bad because it will result in
invocation of rand() two times and rand() UDF is non-deterministic. Both the
rand calls can generate values that can satisfy the predicates independently,
but not together, whereas the original intention of the query is to give
results when rand falls between 0.25 and 0.5.
A sample result:
CA 0.9191984370369802
CA 0.397933021566812
where the condition was not satisfied.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)