max at KuduScanNode

Thomas Tauber-Marshall (JIRA) Fri, 17 Nov 2017 13:38:51 -0800

     [ 
https://issues.apache.org/jira/browse/IMPALA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Tauber-Marshall resolved IMPALA-4252.
--------------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.11.0

commit 2510fe0aa0c86f460af9040eb413aad76c13cc84  
Author: Thomas Tauber-Marshall <[email protected]>  
Date:   Mon Oct 23 07:58:34 2017 -0700  
  
    IMPALA-4252: Min-max runtime filters for Kudu  
      
    This patch implements min-max filters for runtime filters. Each  
    runtime filter generates a bloom filter or a min-max filter,  
    depending on if it has HDFS or Kudu targets, respectively.  
      
    In RuntimeFilterGenerator in the planner, each hash join node  
    generates a bloom and min-max filter for each equi-join predicate, but  
    only those filters that end up being assigned to a target make it into  
    the final plan.  
      
    Min-max filters are only assigned to Kudu scans if the target expr is  
    a column, as Kudu doesn't support bounds on general exprs, and only if  
    the join op is '=' and not 'is distinct from', as Kudu doesn't support  
    returning NULLs if a bound is set.  
      
    Min-max filters are inserted into by the PartitionedHashJoinBuilder.  
    Codegen is used to eliminate branching on the type of filter. String  
    min-max filters truncate their bounds at 1024 chars, so that the max  
    amount of memory used by min-max filters is negligible.  
      
    For now, min-max filters are only applied at the KuduScanner, which  
    passes them into the Kudu client.  
      
    Future work will address applying min-max filters at HDFS scan nodes  
    and applying bloom filters at Kudu scan nodes.  
      
    Functional Testing:  
    - Added new planner tests and updated the old ones. (in old tests, a  
      lot of runtime filters are renumbered as we always generate min-max  
      filters even if they don't end up getting assigned and they take up  
      some of the RF ids).  
    - Updated existing runtime filter tests to work with Kudu.  
    - Added e2e tests for min-max filter specific functionality.  
      
    Perf Testing:  
    - All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,  
      timings are averages of 3 runs.  
    - Ran a contrived query with a filter that does not eliminate any rows  
      (full self join of lineitem). The difference in running time was  
      negligible - 24.46s with filters on, 24.15s with filters off for  
      a ~1% slowdown.  
    - Ran a contrived query with a filter that elimiates all rows (self  
      join on lineitem with a join condition that never matches). The  
      filters resulted in a significant speedup - 0.26s with filters on,  
      1.46s with filters off for a ~5.6x speedup. This query is added to  
      targeted-perf.  
      
    Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c  
    Reviewed-on: http://gerrit.cloudera.org:8080/7793  
    Reviewed-by: Thomas Tauber-Marshall <[email protected]>  
    Tested-by: Impala Public Jenkins 

> Add RuntimeFilters for "in list" and/or min/max at KuduScanNode
> ---------------------------------------------------------------
>
>                 Key: IMPALA-4252
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4252
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>    Affects Versions: Kudu_Impala
>            Reporter: Matthew Jacobs
>            Assignee: Thomas Tauber-Marshall
>              Labels: kudu, performance, runtime-filters
>             Fix For: Impala 2.11.0
>
>
> Kudu scans will benefit significantly from runtime filters. For now, we could 
> generate an 'in list' and/or  min/max values instead of bloom filters to push 
> to the KuduScanNode and, with support from the Kudu client 
> ([KUDU-1683|https://issues.apache.org/jira/browse/KUDU-1683]), we could push 
> them to Kudu during execution. At some point, it would be nice to push bloom 
> filters to Kudu (IMPALA-3741), but that will require more work and should be 
> a potential follow-up task to this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (IMPALA-4252) Add RuntimeFilters for "in list" and/or min/max at KuduScanNode

Reply via email to