Re: Dynamic Runtime Filter Question

John Russell Fri, 08 Apr 2016 11:54:07 -0700

Ken, what is the extra detail you were hoping to see in the docs?  More 
examples of SQL idioms that do/don't result in runtime filters?  (Like in your 
example, I could imagine being curious about INNER JOIN as opposed to other 
kinds of joins, ON clause with several ANDed predicates, and BETWEEN in the 
WHERE clause.)


We're planning to add more such examples in upcoming doc refreshes.  Just want 
to make sure we cover the kinds of areas you're curious about.  Once 5.7 is 
installed somewhere to try out, it's straightforward to verify for particular 
combinations of tables and queries using EXPLAIN and looking for the extra 
lines representing filters.

John

> On Apr 8, 2016, at 11:39 AM, Mostafa Mokhtar <[email protected]> wrote:
> 
> Hi Ken,
> 
> Yes the query provided should work, I validated that the query below works as 
> expected which is fairly similar to the one listed above.
> 
> select 
>     count(*)
> from
>     store_sales_mp,
>     date_dim
> where
>     ss_sold_date_sk = d_date_sk
>         and ss_store_sk = d_dow
>         and d_year = 1999
>         and d_moy = 1
> 
> 
> store_sales_mp is partitioned on two columns :
>   | # col_name                   | data_type                                  
>                                              | ss_sold_date_sk              | 
> bigint                                                                        
>            | ss_store_sk                  | bigint                            
>                                                        
> 
> 
> The query execution summary shows only a small number of rows from 
> store_sales_mp qualify the scan
> 
> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
> | Operator        | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                                     |
> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
> | 06:AGGREGATE    | 1      | 160.66ms | 160.66ms | 1      | 1          | 
> 88.00 KB | -1 B          | FINALIZE                                   |
> | 05:EXCHANGE     | 1      | 137.98us | 137.98us | 19     | 1          | 0 B  
>     | -1 B          | UNPARTITIONED                              |
> | 03:AGGREGATE    | 19     | 151.01ms | 183.65ms | 19     | 1          | 8.04 
> MB  | 10.00 MB      |                                            |
> | 02:HASH JOIN    | 19     | 101.27ms | 185.51ms | 4.82M  | 43.20B     | 2.02 
> MB  | 3.89 KB       | INNER JOIN, BROADCAST                      |
> | |--04:EXCHANGE  | 19     | 14.54us  | 25.36us  | 589    | 181        | 0 B  
>     | 0 B           | BROADCAST                                  |
> | |  01:SCAN HDFS | 1      | 20.39ms  | 20.39ms  | 31     | 181        | 1.82 
> MB  | 64.00 MB      | tpcds_15000_decimal_parquet.date_dim       |
> | 00:SCAN HDFS    | 19     | 37.54ms  | 81.18ms  | 45.00M | 43.20B     | 9.84 
> MB  | 0 B           | tpcds_15000_decimal_parquet.store_sales_mp |
> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
> 
> 
> 
> Finally from the query profile  
> 
>       HDFS_SCAN_NODE (id=0):(Total: 37.539ms, non-child: 37.539ms, % 
> non-child: 100.00%)
>          - AverageHdfsReadThreadConcurrency: 0.00 
>          - AverageScannerThreadConcurrency: 0.93 
>          - BytesRead: 489.47 KB (501221)
>          - BytesReadDataNodeCache: 0
>          - BytesReadLocal: 468.42 KB (479663)
>          - BytesReadRemoteUnexpected: 21.05 KB (21557)
>          - BytesReadShortCircuit: 468.42 KB (479663)
>          - DecompressionTime: 0.000ns
>          - MaxCompressedTextFileLength: 0
>          - NumColumns: 0 (0)
>          - NumDisksAccessed: 4 (4)
>          - NumRowGroups: 0 (0)
>          - NumScannerThreadsStarted: 5 (5)
>          - PeakMemoryUsage: 9.35 MB (9799572)
>          - PerReadThreadRawHdfsThroughput: 901.98 MB/sec
>          - RemoteScanRanges: 0 (0)
>          - RowsRead: 2.37M (2368521)
>          - RowsReturned: 2.37M (2368521)
>          - RowsReturnedRate: 62.23 M/sec
>          - ScanRangesComplete: 3.21K (3207)
>          - ScannerThreadsInvoluntaryContextSwitches: 26 (26)
>          - ScannerThreadsTotalWallClockTime: 420.999ms
>            - MaterializeTupleTime(*): 471.000ns
>            - ScannerThreadsSysTime: 82.142ms
>            - ScannerThreadsUserTime: 65.777ms
>          - ScannerThreadsVoluntaryContextSwitches: 5.41K (5413)
>          - TotalRawHdfsReadTime(*): 1.940ms
>          - TotalReadThroughput: 136.84 KB/sec
>         Filter 0:
>            - Files processed: 3.18K (3184)
>            - Files rejected: 3.13K (3130)
>            - Files total: 3.18K (3184)
>            - RowGroups processed: 0 (0)
>            - RowGroups rejected: 0 (0)
>            - RowGroups total: 0 (0)
>            - Rows processed: 0 (0)
>            - Rows rejected: 0 (0)
>            - Rows total: 0 (0)
>            - Splits processed: 4 (4)
>            - Splits rejected: 0 (0)
>            - Splits total: 4 (4)
>         Filter 1:
>            - Files processed: 54 (54)
>            - Files rejected: 49 (49)
>            - Files total: 54 (54)
>            - RowGroups processed: 0 (0)
>            - RowGroups rejected: 0 (0)
>            - RowGroups total: 0 (0)
>            - Rows processed: 0 (0)
>            - Rows rejected: 0 (0)
>            - Rows total: 0 (0)
>            - Splits processed: 4 (4)
>            - Splits rejected: 0 (0)
>            - Splits total: 4 (4)
> 
> 
> Thanks
> Mostafa 
> 
> 
> On Fri, Apr 8, 2016 at 10:04 AM, Ken Farmer <[email protected] 
> <mailto:[email protected]>> wrote:
> First off - congrats on this new release.  It looks like a lot of hard work 
> went into it.
> 
> I'm really excited about dynamic runtime filters, but have a question about 
> whether it would apply to the following scenario, which seems oddly missing 
> from the documentation.
> 
> Assume 2 tables:
> 1.  fact_ids table with 50 billion rows partitioned by year, month, day, 
> sensor_id
> 2.  dim_date dimension table with 500 rows, 1 row/day with date_id, and 
> date_name, year, month, day columns
> 
> Query that works fine today:
>     SELECT f.col1,
>            SUM(somecount)
>     FROM fact_ids f
>     WHERE f.year=2016 AND f.month=4 AND f.day=2
>     GROUP BY 1  ORDER BY 1    
> 
> 
> Query that I hope works with 5.7:
>      SELECT f.col1,
>             SUM(somecount)
>      FROM fact_ids f
>           INNER JOIN dim_date d
>               ON f.year   = d.year
>               AND f.month = d.month
>               AND f.day   = d.day
>      WHERE d.date_name BETWEEN '2016-03-02' AND '2016-04-01'
>      GROUP BY 1  ORDER BY 1
> 
> Can anyone confirm that 5.7 will perform dynamic runtime filtering on this 
> second example above?  
> 
> Anything else I should be aware of for cases like this?  Or limitations of 
> dynamic runtime filtering?
> 
> 
> Thanks!
> 
> 
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Impala User" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Impala User" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.

Re: Dynamic Runtime Filter Question

Reply via email to