Re: Dynamic Runtime Filter Question

John Russell Fri, 08 Apr 2016 14:39:03 -0700

> On Apr 8, 2016, at 12:06 PM, Ken Farmer <[email protected]> wrote:
> 
> > Ken, what is the extra detail you were hoping to see in the docs?  
> > More examples of SQL idioms that do/don't result in runtime filters? 
> 
> Yep, you got it: because my partitioning depends on multiple columns, 
> typically often used together, I really need to use a join rather than 
> subquery.
> 
> And at this point, assuming that the dynamic runtime filters work well & 
> consistently I'll probably shift our standard partitioning columns away from 
> year, month, day to date_id - a foreign key into.  Unless anyone can 
> recommend a reason not to.


A couple of thoughts on that last point, about changing partition scheme and 
joining on a single string column.  (From the book-learning perspective, since 
I don't run daily reports myself, so I tend not to partition so granularly.)

Sounds overall like a reasonable idea, and I'm happy that the notion of runtime 
filtering opens this up as a possibility for you.

I guess you won't consolidate any more data within each partition, since there 
will still be the same number of 'days' in each case.  I'm always looking for 
the opportunity to take partition-by-day tables and switch to 
partition-by-month, to try to get decent-sized files in each partition.  Sounds 
like you have 100M rows / day, but also divided by sensor.  How many rows in a 
typical partition?

I've heard many times about better efficiency of integer vs. string for joins, 
group by, etc.  I wonder if it would be faster and more memory-efficient 
overall to encode the day ID as INT, e.g. 20160408.  Then decompose/reformat 
only at the end for reporting purposes, or keep nicely formatted string 
representations in the dim_date dimension table.  That is an experiment I 
haven't tried with enough volume to see for sure.

John

> On Fri, Apr 8, 2016 at 12:53 PM, John Russell <[email protected] 
> <mailto:[email protected]>> wrote:
> Ken, what is the extra detail you were hoping to see in the docs?  More 
> examples of SQL idioms that do/don't result in runtime filters?  (Like in 
> your example, I could imagine being curious about INNER JOIN as opposed to 
> other kinds of joins, ON clause with several ANDed predicates, and BETWEEN in 
> the WHERE clause.)  
> 
> We're planning to add more such examples in upcoming doc refreshes.  Just 
> want to make sure we cover the kinds of areas you're curious about.  Once 5.7 
> is installed somewhere to try out, it's straightforward to verify for 
> particular combinations of tables and queries using EXPLAIN and looking for 
> the extra lines representing filters.
> 
> John
> 
> 
>> On Apr 8, 2016, at 11:39 AM, Mostafa Mokhtar <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Ken,
>> 
>> Yes the query provided should work, I validated that the query below works 
>> as expected which is fairly similar to the one listed above.
>> 
>> select 
>>     count(*)
>> from
>>     store_sales_mp,
>>     date_dim
>> where
>>     ss_sold_date_sk = d_date_sk
>>         and ss_store_sk = d_dow
>>         and d_year = 1999
>>         and d_moy = 1
>> 
>> 
>> store_sales_mp is partitioned on two columns :
>>   | # col_name                   | data_type                                 
>>                                               | ss_sold_date_sk              
>> | bigint                                                                     
>>               | ss_store_sk                  | bigint                        
>>                                                           
>> 
>> 
>> The query execution summary shows only a small number of rows from 
>> store_sales_mp qualify the scan
>> 
>> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
>> | Operator        | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | 
>> Peak Mem | Est. Peak Mem | Detail                                     |
>> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
>> | 06:AGGREGATE    | 1      | 160.66ms | 160.66ms | 1      | 1          | 
>> 88.00 KB | -1 B          | FINALIZE                                   |
>> | 05:EXCHANGE     | 1      | 137.98us | 137.98us | 19     | 1          | 0 B 
>>      | -1 B          | UNPARTITIONED                              |
>> | 03:AGGREGATE    | 19     | 151.01ms | 183.65ms | 19     | 1          | 
>> 8.04 MB  | 10.00 MB      |                                            |
>> | 02:HASH JOIN    | 19     | 101.27ms | 185.51ms | 4.82M  | 43.20B     | 
>> 2.02 MB  | 3.89 KB       | INNER JOIN, BROADCAST                      |
>> | |--04:EXCHANGE  | 19     | 14.54us  | 25.36us  | 589    | 181        | 0 B 
>>      | 0 B           | BROADCAST                                  |
>> | |  01:SCAN HDFS | 1      | 20.39ms  | 20.39ms  | 31     | 181        | 
>> 1.82 MB  | 64.00 MB      | tpcds_15000_decimal_parquet.date_dim       |
>> | 00:SCAN HDFS    | 19     | 37.54ms  | 81.18ms  | 45.00M | 43.20B     | 
>> 9.84 MB  | 0 B           | tpcds_15000_decimal_parquet.store_sales_mp |
>> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
>> 
>> 
>> 
>> Finally from the query profile  
>> 
>>       HDFS_SCAN_NODE (id=0):(Total: 37.539ms, non-child: 37.539ms, % 
>> non-child: 100.00%)
>>          - AverageHdfsReadThreadConcurrency: 0.00 
>>          - AverageScannerThreadConcurrency: 0.93 
>>          - BytesRead: 489.47 KB (501221)
>>          - BytesReadDataNodeCache: 0
>>          - BytesReadLocal: 468.42 KB (479663)
>>          - BytesReadRemoteUnexpected: 21.05 KB (21557)
>>          - BytesReadShortCircuit: 468.42 KB (479663)
>>          - DecompressionTime: 0.000ns
>>          - MaxCompressedTextFileLength: 0
>>          - NumColumns: 0 (0)
>>          - NumDisksAccessed: 4 (4)
>>          - NumRowGroups: 0 (0)
>>          - NumScannerThreadsStarted: 5 (5)
>>          - PeakMemoryUsage: 9.35 MB (9799572)
>>          - PerReadThreadRawHdfsThroughput: 901.98 MB/sec
>>          - RemoteScanRanges: 0 (0)
>>          - RowsRead: 2.37M (2368521)
>>          - RowsReturned: 2.37M (2368521)
>>          - RowsReturnedRate: 62.23 M/sec
>>          - ScanRangesComplete: 3.21K (3207)
>>          - ScannerThreadsInvoluntaryContextSwitches: 26 (26)
>>          - ScannerThreadsTotalWallClockTime: 420.999ms
>>            - MaterializeTupleTime(*): 471.000ns
>>            - ScannerThreadsSysTime: 82.142ms
>>            - ScannerThreadsUserTime: 65.777ms
>>          - ScannerThreadsVoluntaryContextSwitches: 5.41K (5413)
>>          - TotalRawHdfsReadTime(*): 1.940ms
>>          - TotalReadThroughput: 136.84 KB/sec
>>         Filter 0:
>>            - Files processed: 3.18K (3184)
>>            - Files rejected: 3.13K (3130)
>>            - Files total: 3.18K (3184)
>>            - RowGroups processed: 0 (0)
>>            - RowGroups rejected: 0 (0)
>>            - RowGroups total: 0 (0)
>>            - Rows processed: 0 (0)
>>            - Rows rejected: 0 (0)
>>            - Rows total: 0 (0)
>>            - Splits processed: 4 (4)
>>            - Splits rejected: 0 (0)
>>            - Splits total: 4 (4)
>>         Filter 1:
>>            - Files processed: 54 (54)
>>            - Files rejected: 49 (49)
>>            - Files total: 54 (54)
>>            - RowGroups processed: 0 (0)
>>            - RowGroups rejected: 0 (0)
>>            - RowGroups total: 0 (0)
>>            - Rows processed: 0 (0)
>>            - Rows rejected: 0 (0)
>>            - Rows total: 0 (0)
>>            - Splits processed: 4 (4)
>>            - Splits rejected: 0 (0)
>>            - Splits total: 4 (4)
>> 
>> 
>> Thanks
>> Mostafa 
>> 
>> 
>> On Fri, Apr 8, 2016 at 10:04 AM, Ken Farmer <[email protected] 
>> <mailto:[email protected]>> wrote:
>> First off - congrats on this new release.  It looks like a lot of hard work 
>> went into it.
>> 
>> I'm really excited about dynamic runtime filters, but have a question about 
>> whether it would apply to the following scenario, which seems oddly missing 
>> from the documentation.
>> 
>> Assume 2 tables:
>> 1.  fact_ids table with 50 billion rows partitioned by year, month, day, 
>> sensor_id
>> 2.  dim_date dimension table with 500 rows, 1 row/day with date_id, and 
>> date_name, year, month, day columns
>> 
>> Query that works fine today:
>>     SELECT f.col1,
>>            SUM(somecount)
>>     FROM fact_ids f
>>     WHERE f.year=2016 AND f.month=4 AND f.day=2
>>     GROUP BY 1  ORDER BY 1    
>> 
>> 
>> Query that I hope works with 5.7:
>>      SELECT f.col1,
>>             SUM(somecount)
>>      FROM fact_ids f
>>           INNER JOIN dim_date d
>>               ON f.year   = d.year
>>               AND f.month = d.month
>>               AND f.day   = d.day
>>      WHERE d.date_name BETWEEN '2016-03-02' AND '2016-04-01'
>>      GROUP BY 1  ORDER BY 1
>> 
>> Can anyone confirm that 5.7 will perform dynamic runtime filtering on this 
>> second example above?  
>> 
>> Anything else I should be aware of for cases like this?  Or limitations of 
>> dynamic runtime filtering?
>> 
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Impala User" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] 
>> <mailto:[email protected]>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Impala User" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] 
>> <mailto:[email protected]>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Impala User" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Impala User" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.

Re: Dynamic Runtime Filter Question

Reply via email to