[jira] [Commented] (IMPALA-9877) uuid() generates multiple values per row when joined from a CTE

Tim Armstrong (Jira) Mon, 22 Jun 2020 08:55:05 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142213#comment-17142213
 ]


Tim Armstrong commented on IMPALA-9877:
---------------------------------------

[~koke] yeah it's not necessarily that well-defined when the random functions 
are evaluated when they're used in inline views. This is a side-effect of how 
we (effectively) inline the views into the rest of the query and then evaluate 
uuid() after doing the join. I think that's undesirable in this case and we 
don't have the special handling of random functions we'd need in inline views.

So you're not missing anything and it would be desirable to fix this.

You could work around in various ways, e.g. adding a union to the subquery 
(this exploits a quirk in the impala planner). You could also add a distinct to 
the subquery, 

Something like this should work.
{noformat}
with sessions as (
    select
        uid,
        uuid() session_id,
        min(ts) session_start
    from events
    group by uid
    union all
    select uid, '', ''
    from events
    limit 0
)
select
    session_id,
    events.*
from sessions join events on sessions.uid = events.uid;
{noformat}


This is the explain plan i get running on some tables I have locally:
{noformat}
explain select v.id, v.int_col, t.id, uuid from (select id, int_col, uuid() as 
uuid from functional_parquet.alltypestiny UNION ALL select 0, 0, '' from 
functional_parquet.alltypes limit 0) v join functional_parquet.alltypessmall t 
on v.int_col = t.int_col;
Query: explain select v.id, v.int_col, t.id, uuid from (select id, int_col, 
uuid() as uuid from functional_parquet.alltypestiny UNION ALL select 0, 0, '' 
from functional_parquet.alltypes limit 0) v join 
functional_parquet.alltypessmall t on v.int_col = t.int_col
+------------------------------------------------------------------------------------+
| Explain String                                                                
     |
+------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=3.97MB Threads=6                    
     |
| Per-Host Resource Estimates: Memory=68MB                                      
     |
| WARNING: The following tables are missing relevant table and/or column 
statistics. |
| functional_parquet.alltypessmall, functional_parquet.alltypestiny             
     |
|                                                                               
     |
| PLAN-ROOT SINK                                                                
     |
| |                                                                             
     |
| 06:EXCHANGE [UNPARTITIONED]                                                   
     |
| |                                                                             
     |
| 03:HASH JOIN [INNER JOIN, PARTITIONED]                                        
     |
| |  hash predicates: int_col = t.int_col                                       
     |
| |  runtime filters: RF000 <- t.int_col                                        
     |
| |  row-size=28B cardinality=758                                               
     |
| |                                                                             
     |
| |--05:EXCHANGE [HASH(t.int_col)]                                              
     |
| |  |                                                                          
     |
| |  02:SCAN HDFS [functional_parquet.alltypessmall t]                          
     |
| |     HDFS partitions=4/4 files=4 size=14.78KB                                
     |
| |     row-size=8B cardinality=940                                             
     |
| |                                                                             
     |
| 04:EXCHANGE [HASH(int_col)]                                                   
     |
| |                                                                             
     |
| 00:UNION                                                                      
     |
| |  row-size=20B cardinality=758                                               
     |
| |                                                                             
     |
| 01:SCAN HDFS [functional_parquet.alltypestiny]                                
     |
|    HDFS partitions=4/4 files=4 size=11.92KB                                   
     |
|    runtime filters: RF000 -> functional_parquet.alltypestiny.int_col          
     |
|    row-size=8B cardinality=758                                                
     |
+------------------------------------------------------------------------------------+
Fetched 29 row(s) in 0.04s
{noformat}

> uuid() generates multiple values per row when joined from a CTE
> ---------------------------------------------------------------
>
>                 Key: IMPALA-9877
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9877
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.12.0
>         Environment: impalad version 2.12.0-cdh5.16.1 RELEASE (build 
> 4a3775ef6781301af81b23bca45a9faeca5e761d)
>            Reporter: Jorge Bernal
>            Priority: Major
>
> I'm not sure if I'm misunderstanding the execution model here, but I would 
> expect that when I use uuid() on a CTE, it returns a different value for each 
> row of the CTE, and not one value per row of the final join.  
> It's hard to explain, so here's an example:
>  
> {code:java}
> drop table if exists events;
> create table events (uid int, ts int);
> insert into events values (1,1), (2,2), (1,3);
> +-----+----+
> | uid | ts |
> +-----+----+
> | 1   | 1  |
> | 2   | 2  |
> | 1   | 3  |
> +-----+----+{code}
>  
> If you then use uuid() on a CTE, and join it with something else, the 
> resulting values might have a different uuid for the same input row:
> {code:java}
> with sessions as (
>     select
>         uid,
>         uuid() session_id,
>         min(ts) session_start
>     from events
>     group by uid
> )
> select
>     session_id,
>     events.*
> from sessions join events on sessions.uid = events.uid;
> drop table if exists koke_events;
> +--------------------------------------+-----+----+
> | session_id                           | uid | ts |
> +--------------------------------------+-----+----+
> | 71d3671a-fa97-4c00-8d90-d1584ad7cf18 | 2   | 2  |
> | af564c03-2db3-44cb-bf5b-7748e231af72 | 1   | 3  |
> | d4bc2779-1bd9-4a5e-a5e2-dc28b00d2f34 | 1   | 1  |
> +--------------------------------------+-----+----+
> {code}
> I would expect that rows with the same {{uid}} would have the same 
> {{session_id}} in this result.
> When I run the same query in Hive, that's the behavior:
> {code:java}
> +---------------------------------------+-------------+------------+--+
> |              session_id               | events.uid  | events.ts  |
> +---------------------------------------+-------------+------------+--+
> | e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 1          |
> | e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 3          |
> | 72f8a922-738e-4151-ba93-3b5e50eb9140  | 2           | 2          |
> +---------------------------------------+-------------+------------+--+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9877) uuid() generates multiple values per row when joined from a CTE

Reply via email to