Yida Wu created IMPALA-14146:
--------------------------------

             Summary: Incorrect rand() evaluation behavior in where condition
                 Key: IMPALA-14146
                 URL: https://issues.apache.org/jira/browse/IMPALA-14146
             Project: IMPALA
          Issue Type: Bug
    Affects Versions: Impala 4.5.0
            Reporter: Yida Wu


We've observed cases where rand() is re-evaluated multiple times within 
predicates, which may result in incorrect query results.
For example:
{code:java}
Create table test1 (a int);
Insert into test1 values (1), (1), (1), (1);
{code}
Query:
{code:java}
select * from (select rand(1) as rd from test1) t1 where rd > 0.4 and rd < 0.4;
{code}
Expected: No rows should match, as no value can be both greater than and less 
than 0.4.
Actual Results:
{code:java}
+---------------------+
| rd                  |
+---------------------+
| 0.41702283693685577 |
+---------------------+
Fetched 1 row(s) in 0.12s
{code}
>From the log I have added and the plan, I can see rand() evaluated twice for 
>the rd, however even with two different values generated, it remains unclear 
>how the overall condition evaluates to TRUE, as both values are larger than 
>0.4.
{code:java}
4081:I20250613 17:58:57.296629 23889 math-functions-ir.cc:207] 
e449dc09337fadec:1fb3575500000001] Rand() generated value: 0.417023
4082:I20250613 17:58:57.296660 23889 math-functions-ir.cc:207] 
e449dc09337fadec:1fb3575500000001] Rand() generated value: 0.458344
{code}
{code:java}
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
thread-reservation=1
PLAN-ROOT SINK
|  output exprs: rand(CAST(1 AS BIGINT))
|  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
thread-reservation=0
|
01:EXCHANGE [UNPARTITIONED]
|  mem-estimate=16.00KB mem-reservation=0B thread-reservation=0
|  tuple-ids=0 row-size=0B cardinality=1
|  in pipelines: 00(GETNEXT)
|
F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
Per-Host Resources: mem-estimate=32.02MB mem-reservation=8.00KB 
thread-reservation=2
00:SCAN HDFS [default.test1, RANDOM]
   HDFS partitions=1/1 files=1 size=8B
   predicates: rand(CAST(1 AS BIGINT)) < CAST(0.4 AS DOUBLE), rand(CAST(1 AS 
BIGINT)) > CAST(0.4 AS DOUBLE)
   stored statistics:
     table: rows=unavailable size=unavailable
     columns: all
   extrapolated-rows=disabled max-scan-range-rows=unavailable
{code}
In summary, rand() appears to be evaluated multiple times within the 
predicates. The value should be computed only once as part of the output 
expression, rather than being re-evaluated as a function in each predicate, but 
there may be additional issues contributing to the incorrect results as the 
example shows, could be related to IMPALA-14145.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to