Re: Dynamic Runtime Filter Question

Ken Farmer Fri, 08 Apr 2016 11:51:43 -0700

Fantastic - that's just what I needed in order to prioritize doing a full
test & analysis of 5.7 this month.


Thanks again,

Ken

On Fri, Apr 8, 2016 at 12:39 PM, Mostafa Mokhtar <[email protected]>
wrote:

> Hi Ken,
>
> Yes the query provided should work, I validated that the query below works
> as expected which is fairly similar to the one listed above.
>
> select
>     count(*)
> from
>     store_sales_mp,
>     date_dim
> where
>     ss_sold_date_sk = d_date_sk
>         and ss_store_sk = d_dow
>         and d_year = 1999
>         and d_moy = 1
>
>
> store_sales_mp is partitioned on two columns :
>   | # col_name                   | data_type
>                                                   | ss_sold_date_sk
>        | bigint
>                       | ss_store_sk                  | bigint
>
>
>
> The query execution summary shows only a small number of rows from
> store_sales_mp qualify the scan
>
>
> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
> | Operator        | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows |
> Peak Mem | Est. Peak Mem | Detail                                     |
>
> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
> | 06:AGGREGATE    | 1      | 160.66ms | 160.66ms | 1      | 1          |
> 88.00 KB | -1 B          | FINALIZE                                   |
> | 05:EXCHANGE     | 1      | 137.98us | 137.98us | 19     | 1          | 0
> B      | -1 B          | UNPARTITIONED                              |
> | 03:AGGREGATE    | 19     | 151.01ms | 183.65ms | 19     | 1          |
> 8.04 MB  | 10.00 MB      |                                            |
> | 02:HASH JOIN    | 19     | 101.27ms | 185.51ms | 4.82M  | 43.20B     |
> 2.02 MB  | 3.89 KB       | INNER JOIN, BROADCAST                      |
> | |--04:EXCHANGE  | 19     | 14.54us  | 25.36us  | 589    | 181        | 0
> B      | 0 B           | BROADCAST                                  |
> | |  01:SCAN HDFS | 1      | 20.39ms  | 20.39ms  | 31     | 181        |
> 1.82 MB  | 64.00 MB      | tpcds_15000_decimal_parquet.date_dim       |
> | 00:SCAN HDFS    | 19     | 37.54ms  | 81.18ms  | 45.00M | 43.20B     |
> 9.84 MB  | 0 B           | tpcds_15000_decimal_parquet.store_sales_mp |
>
> +-----------------+--------+----------+----------+--------+------------+----------+---------------+--------------------------------------------+
>
>
>
> Finally from the query profile
>
>       HDFS_SCAN_NODE (id=0):(Total: 37.539ms, non-child: 37.539ms, % 
> non-child: 100.00%)
>          - AverageHdfsReadThreadConcurrency: 0.00
>          - AverageScannerThreadConcurrency: 0.93
>          - BytesRead: 489.47 KB (501221)
>          - BytesReadDataNodeCache: 0
>          - BytesReadLocal: 468.42 KB (479663)
>          - BytesReadRemoteUnexpected: 21.05 KB (21557)
>          - BytesReadShortCircuit: 468.42 KB (479663)
>          - DecompressionTime: 0.000ns
>          - MaxCompressedTextFileLength: 0
>          - NumColumns: 0 (0)
>          - NumDisksAccessed: 4 (4)
>          - NumRowGroups: 0 (0)
>          - NumScannerThreadsStarted: 5 (5)
>          - PeakMemoryUsage: 9.35 MB (9799572)
>          - PerReadThreadRawHdfsThroughput: 901.98 MB/sec
>          - RemoteScanRanges: 0 (0)
>          - RowsRead: 2.37M (2368521)
>          - RowsReturned: 2.37M (2368521)
>          - RowsReturnedRate: 62.23 M/sec
>          - ScanRangesComplete: 3.21K (3207)
>          - ScannerThreadsInvoluntaryContextSwitches: 26 (26)
>          - ScannerThreadsTotalWallClockTime: 420.999ms
>            - MaterializeTupleTime(*): 471.000ns
>            - ScannerThreadsSysTime: 82.142ms
>            - ScannerThreadsUserTime: 65.777ms
>          - ScannerThreadsVoluntaryContextSwitches: 5.41K (5413)
>          - TotalRawHdfsReadTime(*): 1.940ms
>          - TotalReadThroughput: 136.84 KB/sec
>         Filter 0:
>            - Files processed: 3.18K (3184)
>            - Files rejected: 3.13K (3130)
>            - Files total: 3.18K (3184)
>            - RowGroups processed: 0 (0)
>            - RowGroups rejected: 0 (0)
>            - RowGroups total: 0 (0)
>            - Rows processed: 0 (0)
>            - Rows rejected: 0 (0)
>            - Rows total: 0 (0)
>            - Splits processed: 4 (4)
>            - Splits rejected: 0 (0)
>            - Splits total: 4 (4)
>         Filter 1:
>            - Files processed: 54 (54)
>            - Files rejected: 49 (49)
>            - Files total: 54 (54)
>            - RowGroups processed: 0 (0)
>            - RowGroups rejected: 0 (0)
>            - RowGroups total: 0 (0)
>            - Rows processed: 0 (0)
>            - Rows rejected: 0 (0)
>            - Rows total: 0 (0)
>            - Splits processed: 4 (4)
>            - Splits rejected: 0 (0)
>            - Splits total: 4 (4)
>
>
>
> Thanks
> Mostafa
>
>
> On Fri, Apr 8, 2016 at 10:04 AM, Ken Farmer <[email protected]> wrote:
>
>> First off - congrats on this new release.  It looks like a lot of hard
>> work went into it.
>>
>> I'm really excited about dynamic runtime filters, but have a question
>> about whether it would apply to the following scenario, which seems oddly
>> missing from the documentation.
>>
>> Assume 2 tables:
>> 1.  fact_ids table with 50 billion rows partitioned by year, month, day,
>> sensor_id
>> 2.  dim_date dimension table with 500 rows, 1 row/day with date_id, and
>> date_name, year, month, day columns
>>
>> Query that works fine today:
>>     SELECT f.col1,
>>            SUM(somecount)
>>     FROM fact_ids f
>>     WHERE f.year=2016 AND f.month=4 AND f.day=2
>>     GROUP BY 1  ORDER BY 1
>>
>>
>> Query that I hope works with 5.7:
>>      SELECT f.col1,
>>             SUM(somecount)
>>      FROM fact_ids f
>>           INNER JOIN dim_date d
>>               ON f.year   = d.year
>>               AND f.month = d.month
>>               AND f.day   = d.day
>>      WHERE d.date_name BETWEEN '2016-03-02' AND '2016-04-01'
>>      GROUP BY 1  ORDER BY 1
>>
>> Can anyone confirm that 5.7 will perform dynamic runtime filtering on
>> this second example above?
>>
>> Anything else I should be aware of for cases like this?  Or limitations
>> of dynamic runtime filtering?
>>
>>
>> Thanks!
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Impala User" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Impala User" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>

Re: Dynamic Runtime Filter Question

Reply via email to