[ 
https://issues.apache.org/jira/browse/IMPALA-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-660:
------------------------------------

    Assignee:     (was: Alexander Behm)

> Make rand() more non-deterministic
> ----------------------------------
>
>                 Key: IMPALA-660
>                 URL: https://issues.apache.org/jira/browse/IMPALA-660
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 1.2.1
>            Reporter: John Russell
>            Priority: Minor
>              Labels: correctness, downgraded
>
> We document that rand() returns unpredictable values, unless preceded by a 
> call to rand(seed). My expectation from other DBMSes like MySQL and Oracle, 
> or by inferring from POSIX behavior, is that once the random number generator 
> is seeded, there would be an infinite stream of random values that would 
> stretch across multiple queries. However, in practice, Impala rand() resets 
> to the same sequence after each query:
> [localhost:21000] > select rand()*34000 from store limit cast (rand()*34000 
> as int);
> Query: select rand()*34000 from store limit cast (rand()*34000 as int)
> +-------------------+
> | rand() * 34000.0  |
> +-------------------+
> | 16.03013650329324 |
> | 20046.04365399389 |
> | 15068.46292087271 |
> | 2513.48005631635  |
> | 15713.13279760682 |
> | 22709.15090977641 |
> | 28398.60192704881 |
> | 5477.171718830788 |
> | 16371.07374722654 |
> | 18740.27370882233 |
> | 16465.31711354168 |
> | 13467.14456540865 |
> +-------------------+
> Returned 12 row(s) in 0.24s
> [localhost:21000] > select rand()*34000 from store limit cast (rand()*34000 
> as int);
> Query: select rand()*34000 from store limit cast (rand()*34000 as int)
> +-------------------+
> | rand() * 34000.0  |
> +-------------------+
> | 16.03013650329324 |
> | 20046.04365399389 |
> | 15068.46292087271 |
> | 2513.48005631635  |
> | 15713.13279760682 |
> | 22709.15090977641 |
> | 28398.60192704881 |
> | 5477.171718830788 |
> | 16371.07374722654 |
> | 18740.27370882233 |
> | 16465.31711354168 |
> | 13467.14456540865 |
> +-------------------+
> Returned 12 row(s) in 0.22s
> And if rand() is called multiple times in the same query, it gives the same 
> value each time:
> [localhost:21000] > select rand(), rand(), rand() from store;
> Query: select rand(), rand(), rand() from store
> +-----------------------+-----------------------+-----------------------+
> | rand()                | rand()                | rand()                |
> +-----------------------+-----------------------+-----------------------+
> | 0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 |
> | 0.5895895192351144    | 0.5895895192351144    | 0.5895895192351144    |
> | 0.4431900859080209    | 0.4431900859080209    | 0.4431900859080209    |
> | 0.0739258840093044    | 0.0739258840093044    | 0.0739258840093044    |
> | 0.4621509646354946    | 0.4621509646354946    | 0.4621509646354946    |
> | 0.6679162032287178    | 0.6679162032287178    | 0.6679162032287178    |
> | 0.8352529978543767    | 0.8352529978543767    | 0.8352529978543767    |
> | 0.1610932858479644    | 0.1610932858479644    | 0.1610932858479644    |
> | 0.4815021690360746    | 0.4815021690360746    | 0.4815021690360746    |
> | 0.5511845208477156    | 0.5511845208477156    | 0.5511845208477156    |
> | 0.4842740327512259    | 0.4842740327512259    | 0.4842740327512259    |
> | 0.3960924872179015    | 0.3960924872179015    | 0.3960924872179015    |
> +-----------------------+-----------------------+-----------------------+
> Returned 12 row(s) in 0.23s
> What I was expecting to happen was:
> select rand(12345);
> select rand() from t1 limit 100;
> ... 100 random values ...
> select rand() from t1 limit 100;
> ... 100 different random values ...
> select rand(), rand(), rand();
> ... 3 different random values ...
> select rand(12345);
> -- Then the sequence of rand() queries as above would give the same results 
> as before.
> Otherwise, calling rand(seed) in a standalone query is kind of a no-op, it 
> has no effect on subsequent queries:
> [localhost:21000] > select rand(12345);
> Query: select rand(12345)
> +--------------------+
> | rand(12345)        |
> +--------------------+
> | 0.4827902789613187 |
> +--------------------+
> Returned 1 row(s) in 0.11s
> [localhost:21000] > select rand(), rand(), rand();
> Query: select rand(), rand(), rand()
> +-----------------------+-----------------------+-----------------------+
> | rand()                | rand()                | rand()                |
> +-----------------------+-----------------------+-----------------------+
> | 0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 |
> +-----------------------+-----------------------+-----------------------+
> Returned 1 row(s) in 0.11s
> [localhost:21000] > select rand(23456);
> Query: select rand(23456)
> +--------------------+
> | rand(23456)        |
> +--------------------+
> | 0.2273406601638257 |
> +--------------------+
> Returned 1 row(s) in 0.11s
> [localhost:21000] > select rand(), rand(), rand();
> Query: select rand(), rand(), rand()
> +-----------------------+-----------------------+-----------------------+
> | rand()                | rand()                | rand()                |
> +-----------------------+-----------------------+-----------------------+
> | 0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 |
> +-----------------------+-----------------------+-----------------------+
> Returned 1 row(s) in 0.12s
> I hope what I'm asking for is not in contradiction to IMPALA-397, which seems 
> to be asking for a rand()-based value to be the same when referenced multiple 
> times through a query alias.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to