[ https://issues.apache.org/jira/browse/IMPALA-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong reassigned IMPALA-660: ------------------------------------ Assignee: (was: Alexander Behm) > Make rand() more non-deterministic > ---------------------------------- > > Key: IMPALA-660 > URL: https://issues.apache.org/jira/browse/IMPALA-660 > Project: IMPALA > Issue Type: Improvement > Components: Frontend > Affects Versions: Impala 1.2.1 > Reporter: John Russell > Priority: Minor > Labels: correctness, downgraded > > We document that rand() returns unpredictable values, unless preceded by a > call to rand(seed). My expectation from other DBMSes like MySQL and Oracle, > or by inferring from POSIX behavior, is that once the random number generator > is seeded, there would be an infinite stream of random values that would > stretch across multiple queries. However, in practice, Impala rand() resets > to the same sequence after each query: > [localhost:21000] > select rand()*34000 from store limit cast (rand()*34000 > as int); > Query: select rand()*34000 from store limit cast (rand()*34000 as int) > +-------------------+ > | rand() * 34000.0 | > +-------------------+ > | 16.03013650329324 | > | 20046.04365399389 | > | 15068.46292087271 | > | 2513.48005631635 | > | 15713.13279760682 | > | 22709.15090977641 | > | 28398.60192704881 | > | 5477.171718830788 | > | 16371.07374722654 | > | 18740.27370882233 | > | 16465.31711354168 | > | 13467.14456540865 | > +-------------------+ > Returned 12 row(s) in 0.24s > [localhost:21000] > select rand()*34000 from store limit cast (rand()*34000 > as int); > Query: select rand()*34000 from store limit cast (rand()*34000 as int) > +-------------------+ > | rand() * 34000.0 | > +-------------------+ > | 16.03013650329324 | > | 20046.04365399389 | > | 15068.46292087271 | > | 2513.48005631635 | > | 15713.13279760682 | > | 22709.15090977641 | > | 28398.60192704881 | > | 5477.171718830788 | > | 16371.07374722654 | > | 18740.27370882233 | > | 16465.31711354168 | > | 13467.14456540865 | > +-------------------+ > Returned 12 row(s) in 0.22s > And if rand() is called multiple times in the same query, it gives the same > value each time: > [localhost:21000] > select rand(), rand(), rand() from store; > Query: select rand(), rand(), rand() from store > +-----------------------+-----------------------+-----------------------+ > | rand() | rand() | rand() | > +-----------------------+-----------------------+-----------------------+ > | 0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 | > | 0.5895895192351144 | 0.5895895192351144 | 0.5895895192351144 | > | 0.4431900859080209 | 0.4431900859080209 | 0.4431900859080209 | > | 0.0739258840093044 | 0.0739258840093044 | 0.0739258840093044 | > | 0.4621509646354946 | 0.4621509646354946 | 0.4621509646354946 | > | 0.6679162032287178 | 0.6679162032287178 | 0.6679162032287178 | > | 0.8352529978543767 | 0.8352529978543767 | 0.8352529978543767 | > | 0.1610932858479644 | 0.1610932858479644 | 0.1610932858479644 | > | 0.4815021690360746 | 0.4815021690360746 | 0.4815021690360746 | > | 0.5511845208477156 | 0.5511845208477156 | 0.5511845208477156 | > | 0.4842740327512259 | 0.4842740327512259 | 0.4842740327512259 | > | 0.3960924872179015 | 0.3960924872179015 | 0.3960924872179015 | > +-----------------------+-----------------------+-----------------------+ > Returned 12 row(s) in 0.23s > What I was expecting to happen was: > select rand(12345); > select rand() from t1 limit 100; > ... 100 random values ... > select rand() from t1 limit 100; > ... 100 different random values ... > select rand(), rand(), rand(); > ... 3 different random values ... > select rand(12345); > -- Then the sequence of rand() queries as above would give the same results > as before. > Otherwise, calling rand(seed) in a standalone query is kind of a no-op, it > has no effect on subsequent queries: > [localhost:21000] > select rand(12345); > Query: select rand(12345) > +--------------------+ > | rand(12345) | > +--------------------+ > | 0.4827902789613187 | > +--------------------+ > Returned 1 row(s) in 0.11s > [localhost:21000] > select rand(), rand(), rand(); > Query: select rand(), rand(), rand() > +-----------------------+-----------------------+-----------------------+ > | rand() | rand() | rand() | > +-----------------------+-----------------------+-----------------------+ > | 0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 | > +-----------------------+-----------------------+-----------------------+ > Returned 1 row(s) in 0.11s > [localhost:21000] > select rand(23456); > Query: select rand(23456) > +--------------------+ > | rand(23456) | > +--------------------+ > | 0.2273406601638257 | > +--------------------+ > Returned 1 row(s) in 0.11s > [localhost:21000] > select rand(), rand(), rand(); > Query: select rand(), rand(), rand() > +-----------------------+-----------------------+-----------------------+ > | rand() | rand() | rand() | > +-----------------------+-----------------------+-----------------------+ > | 0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 | > +-----------------------+-----------------------+-----------------------+ > Returned 1 row(s) in 0.12s > I hope what I'm asking for is not in contradiction to IMPALA-397, which seems > to be asking for a rand()-based value to be the same when referenced multiple > times through a query alias. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org