Fantastic - that's just what I needed in order to prioritize doing a full test & analysis of 5.7 this month.
Thanks again, Ken On Fri, Apr 8, 2016 at 12:39 PM, Mostafa Mokhtar <[email protected]> wrote: > Hi Ken, > > Yes the query provided should work, I validated that the query below works > as expected which is fairly similar to the one listed above. > > select > count(*) > from > store_sales_mp, > date_dim > where > ss_sold_date_sk = d_date_sk > and ss_store_sk = d_dow > and d_year = 1999 > and d_moy = 1 > > > store_sales_mp is partitioned on two columns : > | # col_name | data_type > | ss_sold_date_sk > | bigint > | ss_store_sk | bigint > > > > The query execution summary shows only a small number of rows from > store_sales_mp qualify the scan > > > +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+ > | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | > Peak Mem | Est. Peak Mem | Detail | > > +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+ > | 06:AGGREGATE | 1 | 160.66ms | 160.66ms | 1 | 1 | > 88.00 KB | -1 B | FINALIZE | > | 05:EXCHANGE | 1 | 137.98us | 137.98us | 19 | 1 | 0 > B | -1 B | UNPARTITIONED | > | 03:AGGREGATE | 19 | 151.01ms | 183.65ms | 19 | 1 | > 8.04 MB | 10.00 MB | | > | 02:HASH JOIN | 19 | 101.27ms | 185.51ms | 4.82M | 43.20B | > 2.02 MB | 3.89 KB | INNER JOIN, BROADCAST | > | |--04:EXCHANGE | 19 | 14.54us | 25.36us | 589 | 181 | 0 > B | 0 B | BROADCAST | > | | 01:SCAN HDFS | 1 | 20.39ms | 20.39ms | 31 | 181 | > 1.82 MB | 64.00 MB | tpcds_15000_decimal_parquet.date_dim | > | 00:SCAN HDFS | 19 | 37.54ms | 81.18ms | 45.00M | 43.20B | > 9.84 MB | 0 B | tpcds_15000_decimal_parquet.store_sales_mp | > > +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+ > > > > Finally from the query profile > > HDFS_SCAN_NODE (id=0):(Total: 37.539ms, non-child: 37.539ms, % > non-child: 100.00%) > - AverageHdfsReadThreadConcurrency: 0.00 > - AverageScannerThreadConcurrency: 0.93 > - BytesRead: 489.47 KB (501221) > - BytesReadDataNodeCache: 0 > - BytesReadLocal: 468.42 KB (479663) > - BytesReadRemoteUnexpected: 21.05 KB (21557) > - BytesReadShortCircuit: 468.42 KB (479663) > - DecompressionTime: 0.000ns > - MaxCompressedTextFileLength: 0 > - NumColumns: 0 (0) > - NumDisksAccessed: 4 (4) > - NumRowGroups: 0 (0) > - NumScannerThreadsStarted: 5 (5) > - PeakMemoryUsage: 9.35 MB (9799572) > - PerReadThreadRawHdfsThroughput: 901.98 MB/sec > - RemoteScanRanges: 0 (0) > - RowsRead: 2.37M (2368521) > - RowsReturned: 2.37M (2368521) > - RowsReturnedRate: 62.23 M/sec > - ScanRangesComplete: 3.21K (3207) > - ScannerThreadsInvoluntaryContextSwitches: 26 (26) > - ScannerThreadsTotalWallClockTime: 420.999ms > - MaterializeTupleTime(*): 471.000ns > - ScannerThreadsSysTime: 82.142ms > - ScannerThreadsUserTime: 65.777ms > - ScannerThreadsVoluntaryContextSwitches: 5.41K (5413) > - TotalRawHdfsReadTime(*): 1.940ms > - TotalReadThroughput: 136.84 KB/sec > Filter 0: > - Files processed: 3.18K (3184) > - Files rejected: 3.13K (3130) > - Files total: 3.18K (3184) > - RowGroups processed: 0 (0) > - RowGroups rejected: 0 (0) > - RowGroups total: 0 (0) > - Rows processed: 0 (0) > - Rows rejected: 0 (0) > - Rows total: 0 (0) > - Splits processed: 4 (4) > - Splits rejected: 0 (0) > - Splits total: 4 (4) > Filter 1: > - Files processed: 54 (54) > - Files rejected: 49 (49) > - Files total: 54 (54) > - RowGroups processed: 0 (0) > - RowGroups rejected: 0 (0) > - RowGroups total: 0 (0) > - Rows processed: 0 (0) > - Rows rejected: 0 (0) > - Rows total: 0 (0) > - Splits processed: 4 (4) > - Splits rejected: 0 (0) > - Splits total: 4 (4) > > > > Thanks > Mostafa > > > On Fri, Apr 8, 2016 at 10:04 AM, Ken Farmer <[email protected]> wrote: > >> First off - congrats on this new release. It looks like a lot of hard >> work went into it. >> >> I'm really excited about dynamic runtime filters, but have a question >> about whether it would apply to the following scenario, which seems oddly >> missing from the documentation. >> >> Assume 2 tables: >> 1. fact_ids table with 50 billion rows partitioned by year, month, day, >> sensor_id >> 2. dim_date dimension table with 500 rows, 1 row/day with date_id, and >> date_name, year, month, day columns >> >> Query that works fine today: >> SELECT f.col1, >> SUM(somecount) >> FROM fact_ids f >> WHERE f.year=2016 AND f.month=4 AND f.day=2 >> GROUP BY 1 ORDER BY 1 >> >> >> Query that I hope works with 5.7: >> SELECT f.col1, >> SUM(somecount) >> FROM fact_ids f >> INNER JOIN dim_date d >> ON f.year = d.year >> AND f.month = d.month >> AND f.day = d.day >> WHERE d.date_name BETWEEN '2016-03-02' AND '2016-04-01' >> GROUP BY 1 ORDER BY 1 >> >> Can anyone confirm that 5.7 will perform dynamic runtime filtering on >> this second example above? >> >> Anything else I should be aware of for cases like this? Or limitations >> of dynamic runtime filtering? >> >> >> Thanks! >> >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Impala User" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > > -- > You received this message because you are subscribed to the Google Groups > "Impala User" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. >
