Hi Ken,
Yes the query provided should work, I validated that the query below works
as expected which is fairly similar to the one listed above.
select
count(*)
from
store_sales_mp,
date_dim
where
ss_sold_date_sk = d_date_sk
and ss_store_sk = d_dow
and d_year = 1999
and d_moy = 1
store_sales_mp is partitioned on two columns :
| # col_name | data_type
| ss_sold_date_sk
| bigint
| ss_store_sk | bigint
The query execution summary shows only a small number of rows from
store_sales_mp qualify the scan
+-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows |
Peak Mem | Est. Peak Mem | Detail |
+-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
| 06:AGGREGATE | 1 | 160.66ms | 160.66ms | 1 | 1 |
88.00 KB | -1 B | FINALIZE |
| 05:EXCHANGE | 1 | 137.98us | 137.98us | 19 | 1 | 0
B | -1 B | UNPARTITIONED |
| 03:AGGREGATE | 19 | 151.01ms | 183.65ms | 19 | 1 |
8.04 MB | 10.00 MB | |
| 02:HASH JOIN | 19 | 101.27ms | 185.51ms | 4.82M | 43.20B |
2.02 MB | 3.89 KB | INNER JOIN, BROADCAST |
| |--04:EXCHANGE | 19 | 14.54us | 25.36us | 589 | 181 | 0
B | 0 B | BROADCAST |
| | 01:SCAN HDFS | 1 | 20.39ms | 20.39ms | 31 | 181 |
1.82 MB | 64.00 MB | tpcds_15000_decimal_parquet.date_dim |
| 00:SCAN HDFS | 19 | 37.54ms | 81.18ms | 45.00M | 43.20B |
9.84 MB | 0 B | tpcds_15000_decimal_parquet.store_sales_mp |
+-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
Finally from the query profile
HDFS_SCAN_NODE (id=0):(Total: 37.539ms, non-child: 37.539ms, %
non-child: 100.00%)
- AverageHdfsReadThreadConcurrency: 0.00
- AverageScannerThreadConcurrency: 0.93
- BytesRead: 489.47 KB (501221)
- BytesReadDataNodeCache: 0
- BytesReadLocal: 468.42 KB (479663)
- BytesReadRemoteUnexpected: 21.05 KB (21557)
- BytesReadShortCircuit: 468.42 KB (479663)
- DecompressionTime: 0.000ns
- MaxCompressedTextFileLength: 0
- NumColumns: 0 (0)
- NumDisksAccessed: 4 (4)
- NumRowGroups: 0 (0)
- NumScannerThreadsStarted: 5 (5)
- PeakMemoryUsage: 9.35 MB (9799572)
- PerReadThreadRawHdfsThroughput: 901.98 MB/sec
- RemoteScanRanges: 0 (0)
- RowsRead: 2.37M (2368521)
- RowsReturned: 2.37M (2368521)
- RowsReturnedRate: 62.23 M/sec
- ScanRangesComplete: 3.21K (3207)
- ScannerThreadsInvoluntaryContextSwitches: 26 (26)
- ScannerThreadsTotalWallClockTime: 420.999ms
- MaterializeTupleTime(*): 471.000ns
- ScannerThreadsSysTime: 82.142ms
- ScannerThreadsUserTime: 65.777ms
- ScannerThreadsVoluntaryContextSwitches: 5.41K (5413)
- TotalRawHdfsReadTime(*): 1.940ms
- TotalReadThroughput: 136.84 KB/sec
Filter 0:
- Files processed: 3.18K (3184)
- Files rejected: 3.13K (3130)
- Files total: 3.18K (3184)
- RowGroups processed: 0 (0)
- RowGroups rejected: 0 (0)
- RowGroups total: 0 (0)
- Rows processed: 0 (0)
- Rows rejected: 0 (0)
- Rows total: 0 (0)
- Splits processed: 4 (4)
- Splits rejected: 0 (0)
- Splits total: 4 (4)
Filter 1:
- Files processed: 54 (54)
- Files rejected: 49 (49)
- Files total: 54 (54)
- RowGroups processed: 0 (0)
- RowGroups rejected: 0 (0)
- RowGroups total: 0 (0)
- Rows processed: 0 (0)
- Rows rejected: 0 (0)
- Rows total: 0 (0)
- Splits processed: 4 (4)
- Splits rejected: 0 (0)
- Splits total: 4 (4)
Thanks
Mostafa
On Fri, Apr 8, 2016 at 10:04 AM, Ken Farmer <[email protected]> wrote:
> First off - congrats on this new release. It looks like a lot of hard
> work went into it.
>
> I'm really excited about dynamic runtime filters, but have a question
> about whether it would apply to the following scenario, which seems oddly
> missing from the documentation.
>
> Assume 2 tables:
> 1. fact_ids table with 50 billion rows partitioned by year, month, day,
> sensor_id
> 2. dim_date dimension table with 500 rows, 1 row/day with date_id, and
> date_name, year, month, day columns
>
> Query that works fine today:
> SELECT f.col1,
> SUM(somecount)
> FROM fact_ids f
> WHERE f.year=2016 AND f.month=4 AND f.day=2
> GROUP BY 1 ORDER BY 1
>
>
> Query that I hope works with 5.7:
> SELECT f.col1,
> SUM(somecount)
> FROM fact_ids f
> INNER JOIN dim_date d
> ON f.year = d.year
> AND f.month = d.month
> AND f.day = d.day
> WHERE d.date_name BETWEEN '2016-03-02' AND '2016-04-01'
> GROUP BY 1 ORDER BY 1
>
> Can anyone confirm that 5.7 will perform dynamic runtime filtering on this
> second example above?
>
> Anything else I should be aware of for cases like this? Or limitations of
> dynamic runtime filtering?
>
>
> Thanks!
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Impala User" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>