Joe McDonnell created IMPALA-13943:
--------------------------------------
Summary: Add option to seed rand() with scan range information
Key: IMPALA-13943
URL: https://issues.apache.org/jira/browse/IMPALA-13943
Project: IMPALA
Issue Type: Task
Components: Backend
Affects Versions: Impala 5.0.0
Reporter: Joe McDonnell
For conditions that use rand() in a scan node, rand()'s PRNG gets started fresh
for each scan range. This means that each scan range can produce the same
sequence of random number. For example:
{noformat}
create table randtest (i int);
# Create multiple files with the same rows
insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
insert into randtest values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
select i, count(*) from randtest where rand() < 0.5 group by i;
+----------------------------------+-----------------------+
| default.randtest.i (tid=1 sid=1) | count() (tid=1 sid=2) |
+----------------------------------+-----------------------+
| 4 | 3 |
| 6 | 3 |
| 5 | 3 |
| 8 | 3 |
| 1 | 3 |
| 3 | 3 |
+----------------------------------+-----------------------+{noformat}
Since each scan range is getting the same sequence of random numbers from the
PRNG, each scan range is returning the same values. If this was truly random,
it is likely to return all the values 1-10.
One option is to have a mode that hashes the scan range information and uses it
to seed the PRNG to have better randomness in this case. This is still
deterministic for unchanging files.
Another option is to have a mode where rand() uses a random seed for true
non-determinism.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)