[jira] [Commented] (IMPALA-10252) Query returns less number of rows with run-time filtering on integer column in a subquery against functional_parquet schema

Tim Armstrong (Jira) Wed, 21 Oct 2020 12:35:50 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218538#comment-17218538
 ]


Tim Armstrong commented on IMPALA-10252:
----------------------------------------

Yeah the runtime filter in this plan shouldn't be there.

{noformat}
[localhost.EXAMPLE.COM:21050] default> explain select id, int_col, year, month
                                     >  from functional_parquet.alltypessmall s
                                     >  where s.int_col = (select count(*) from 
functional_parquet.alltypestiny t where s.
id = t.id)
                                     >  order by id;
Query: explain select id, int_col, year, month
 from functional_parquet.alltypessmall s
 where s.int_col = (select count(*) from functional_parquet.alltypestiny t 
where s.id = t.id)
 order by id
+------------------------------------------------------------------------------------+
| Explain String                                                                
     |
+------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=71.96MB Threads=6                   
     |
| Per-Host Resource Estimates: Memory=308MB                                     
     |
| WARNING: The following tables are missing relevant table and/or column 
statistics. |
| functional_parquet.alltypessmall, functional_parquet.alltypestiny             
     |
|                                                                               
     |
| PLAN-ROOT SINK                                                                
     |
| |                                                                             
     |
| 08:MERGING-EXCHANGE [UNPARTITIONED]                                           
     |
| |  order by: id ASC                                                           
     |
| |                                                                             
     |
| 04:SORT                                                                       
     |
| |  order by: id ASC                                                           
     |
| |  row-size=16B cardinality=940                                               
     |
| |                                                                             
     |
| 03:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]                                   
     |
| |  hash predicates: s.id = t.id                                               
     |
| |  other predicates: s.int_col = zeroifnull(count(*))                         
     |
| |  runtime filters: RF000 <- zeroifnull(count(*))                             
     |
| |  row-size=28B cardinality=940                                               
     |
| |                                                                             
     |
| |--06:AGGREGATE [FINALIZE]                                                    
     |
| |  |  output: count:merge(*)                                                  
     |
| |  |  group by: t.id                                                          
     |
| |  |  row-size=12B cardinality=758                                            
     |
| |  |                                                                          
     |
| |  05:EXCHANGE [HASH(t.id)]                                                   
     |
| |  |                                                                          
     |
| |  02:AGGREGATE [STREAMING]                                                   
     |
| |  |  output: count(*)                                                        
     |
| |  |  group by: t.id                                                          
     |
| |  |  row-size=12B cardinality=758                                            
     |
| |  |                                                                          
     |
| |  01:SCAN HDFS [functional_parquet.alltypestiny t]                           
     |
| |     HDFS partitions=4/4 files=4 size=11.92KB                                
     |
| |     row-size=4B cardinality=758                                             
     |
| |                                                                             
     |
| 07:EXCHANGE [HASH(s.id)]                                                      
     |
| |                                                                             
     |
| 00:SCAN HDFS [functional_parquet.alltypessmall s]                             
     |
|    HDFS partitions=4/4 files=4 size=14.78KB                                   
     |
|    runtime filters: RF000 -> s.int_col                                        
     |
|    row-size=16B cardinality=940                                               
     |
+------------------------------------------------------------------------------------+
{noformat}

The subquery gets rewritten to this:
{noformat}
SELECT id, int_col, `year`, `month`
FROM functional_parquet.alltypessmall s 
   LEFT OUTER JOIN (SELECT count(*) `$c$1`, t.id `$c$2` FROM 
functional_parquet.alltypestiny t GROUP BY t.id) `$a$1` ON s.id = `$a$1`.`$c$2`
WHERE TRUE AND s.int_col = zeroifnull(`$a$1`.`$c$1`)
ORDER BY id ASC;
{noformat}
which has the same correctness issue

> Query returns less number of rows with run-time filtering on integer column 
> in a subquery against functional_parquet schema
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-10252
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10252
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Qifan Chen
>            Assignee: Tim Armstrong
>            Priority: Blocker
>              Labels: correctness
>
> During the work to address IMPALA-6628 (Use unqualified table references in 
> .test files run from test_queries.py), it is found that a query against the 
> functional_parquet database returns 1 row while the same query returns 12 
> rows when run-time filtering is turned off, or against the functional 
> database. 
>  
>  
> {code:java}
> Query: --SET RUNTIME_FILTER_MODE=OFF;
> select id, int_col, year, month
>  from functional_parquet.alltypessmall s
>  where s.int_col = (select count(*) from functional_parquet.alltypestiny t 
> where s.id = t.id)
>  order by id
> Query submitted at: 2020-10-18 12:41:15 (Coordinator: 
> http://qifan-10229:25000)
> Query progress can be monitored at: 
> http://qifan-10229:25000/query_plan?query_id=394a61d8f0002336:fd45e07300000000
> +----+---------+------+-------+
> | id | int_col | year | month |
> +----+---------+------+-------+
> | 1 | 1 | 2009 | 1 |
> +----+---------+------+-------+
> {code}
>  
>  
> {code:java}
> RUNTIME_FILTER_MODE set to OFF
> Query: select id, int_col, year, month 
>  from functional_parquet.alltypessmall s 
>  where s.int_col = (select count(*) from functional_parquet.alltypestiny t 
> where s.id = t.id) 
>  order by id
> Query submitted at: 2020-10-18 12:40:58 (Coordinator: 
> http://qifan-10229:25000)
> Query progress can be monitored at: 
> http://qifan-10229:25000/query_plan?query_id=304c095f478607fc:7d2d03ff00000000
> +----+---------+------+-------+
> | id | int_col | year | month |
> +----+---------+------+-------+
> | 1 | 1 | 2009 | 1 |
> | 10 | 0 | 2009 | 1 |
> | 20 | 0 | 2009 | 1 |
> | 25 | 0 | 2009 | 2 |
> | 35 | 0 | 2009 | 2 |
> | 45 | 0 | 2009 | 2 |
> | 50 | 0 | 2009 | 3 |
> | 60 | 0 | 2009 | 3 |
> | 70 | 0 | 2009 | 3 |
> | 75 | 0 | 2009 | 4 |
> | 85 | 0 | 2009 | 4 |
> | 95 | 0 | 2009 | 4 |
> +----+---------+------+-------+{code}
>  
> Query against functional database.
> {code:java}
> Query: select id, int_col, year, month 
>  from functional.alltypessmall s 
>  where s.int_col = (select count(*) from functional.alltypestiny t where s.id 
> = t.id) 
>  order by id
> Query submitted at: 2020-10-18 12:35:24 (Coordinator: 
> http://qifan-10229:25000)
> Query progress can be monitored at: 
> http://qifan-10229:25000/query_plan?query_id=104bd5d7a6d5fe74:09a6c09000000000
> +----+---------+------+-------+
> | id | int_col | year | month |
> +----+---------+------+-------+
> | 1 | 1 | 2009 | 1 |
> | 10 | 0 | 2009 | 1 |
> | 20 | 0 | 2009 | 1 |
> | 25 | 0 | 2009 | 2 |
> | 35 | 0 | 2009 | 2 |
> | 45 | 0 | 2009 | 2 |
> | 50 | 0 | 2009 | 3 |
> | 60 | 0 | 2009 | 3 |
> | 70 | 0 | 2009 | 3 |
> | 75 | 0 | 2009 | 4 |
> | 85 | 0 | 2009 | 4 |
> | 95 | 0 | 2009 | 4 |
> +----+---------+------+-------+{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-10252) Query returns less number of rows with run-time filtering on integer column in a subquery against functional_parquet schema

Reply via email to