[jira] [Created] (IMPALA-13943) Add option to seed rand() with scan range information

Joe McDonnell (Jira) Tue, 08 Apr 2025 12:24:05 -0700

Joe McDonnell created IMPALA-13943:
--------------------------------------

             Summary: Add option to seed rand() with scan range information
                 Key: IMPALA-13943
                 URL: https://issues.apache.org/jira/browse/IMPALA-13943
             Project: IMPALA
          Issue Type: Task
          Components: Backend
    Affects Versions: Impala 5.0.0
            Reporter: Joe McDonnell



For conditions that use rand() in a scan node, rand()'s PRNG gets started fresh 
for each scan range. This means that each scan range can produce the same 
sequence of random number. For example:
{noformat}
create table randtest (i int);
# Create multiple files with the same rows
insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
select i, count(*) from randtest where rand() < 0.5 group by i;

+----------------------------------+-----------------------+
| default.randtest.i (tid=1 sid=1) | count() (tid=1 sid=2) |
+----------------------------------+-----------------------+
| 4                                | 3                     |
| 6                                | 3                     |
| 5                                | 3                     |
| 8                                | 3                     |
| 1                                | 3                     |
| 3                                | 3                     |
+----------------------------------+-----------------------+{noformat}
Since each scan range is getting the same sequence of random numbers from the 
PRNG, each scan range is returning the same values. If this was truly random, 
it is likely to return all the values 1-10.

One option is to have a mode that hashes the scan range information and uses it 
to seed the PRNG to have better randomness in this case. This is still 
deterministic for unchanging files.

Another option is to have a mode where rand() uses a random seed for true 
non-determinism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-13943) Add option to seed rand() with scan range information

Reply via email to