[ 
https://issues.apache.org/jira/browse/IMPALA-15002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088413#comment-18088413
 ] 

ASF subversion and git services commented on IMPALA-15002:
----------------------------------------------------------

Commit 8f6fdc0f3910503556fc088cc4ef306ac5e96009 in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8f6fdc0f3 ]

IMPALA-14796: Show effective runtime filter targets in profile

This patch adds an "Eff. Tgt. Node(s)" (Effective Target Node(s)) column
to the "Final filter table" in the query profile. This shows which scan
nodes actually had rows rejected by each runtime filter, distinguishing
filters that were effective from those that were applied but rejected no
data. E.g.

 ID  Src. Node  Tgt. Node(s)  Eff. Tgt. Node(s)     Target type  ...
--------------------------------------------------------------------
 10          6             2                  2           LOCAL  ...
  8          7             1                  1          REMOTE  ...
  5          8             2                  2           LOCAL  ...
  4          8             0                  N          REMOTE  ...
  2          9          0, 3               0, 3  REMOTE, REMOTE  ...
  0         10             4                  4          REMOTE  ...

In the above example, filter 4 has "N" in the "Eff. Tgt. Node(s)"
column, which means it doesn't filter out any rows, i.e. effective
target node is "None". All the other filters are effective.

Implementation
 - In ScanNode::Close(), collect the effective runtime filter ids by
   checking the "rejected" counters of all the FilterStats. These
   counters correspond to "Files rejected", "RowGroups rejected", "Rows
   rejected", "Splits rejected" in the query profile. If any of them is
   non-zero, the filter has rejected some data so it's effective.
 - Executor reports this info to coordinator via ReportExecStatus RPCs.
   A list of (filter_id, scan_node_id) pairs is added in
   ReportExecStatusRequestPB to carry this info.
 - Coordinator aggregates the effective filter targets when processing
   the status reports.
 - In FilterDebugString(), add a column to show the node ids where the
   runtime filter is effective.

Other minor changes
 - In coordinator.cc, move the code of setting the "Final filter table"
   from ReleaseExecResources() to ComputeQuerySummary() to ensure the
   final status reports from backends all arrive.
 - Removed temp_object_pool and temp_mem_tracker from
   FilterDebugString() as they have been unused since commit a985e11.
 - Replaced boost::lexical_cast<string> with std::to_string in
   converting int to string which is more optimized.
 - Sort node ids in "Tgt. Node(s)" and "Eff. Tgt. Node(s)" columns to
   make the output consistent across different runs.

Limitation
 - Kudu scanner doesn't expose metrics reflecting effect of individual
   filters so we can't detect effective runtime filters on KuduScanNode.
   Currently the "Eff. Tgt. Node(s)" column of them always has value "N"
   (IMPALA-15002).

Tests
 - Added e2e test for TPCH-Q5 where some filters are ineffective in
   both the original profile and aggregated profile modes.
 - Added checks in runtime_filters.test for queries that have only one
   runtime filter.
 - Updated in_list_filters.test for the new column.
 - Ran tests on both the original planner and the calcite planner.

Assisted-by: Claude Sonnet 4.5
Change-Id: Iccf4b87ac4579a70273f3306ec7b58850f06b17c
Reviewed-on: http://gerrit.cloudera.org:8080/24123
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> No way to determine effective runtime filters on KuduScanNode based on profile
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-15002
>                 URL: https://issues.apache.org/jira/browse/IMPALA-15002
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> Runtime filters on KuduScanNode are pushed down to kudu scanner as normal 
> predicates:
> [https://github.com/apache/impala/blob/e1ca23d627532bb17228e3d455c55a03b3e28f49/be/src/exec/kudu/kudu-scanner.cc#L277-L278]
> [https://github.com/apache/impala/blob/e1ca23d627532bb17228e3d455c55a03b3e28f49/be/src/exec/kudu/kudu-scanner.cc#L319-L331]
> Impala is not aware of the effect of an individual filter. So currently the 
> profile counters of runtime filters on KuduScanNode are all 0. For instance, 
> for the following query
> {code:sql}
> use functional_kudu;
> select STRAIGHT_JOIN count(*) from alltypes p join [BROADCAST] alltypestiny b
> on p.month = b.int_col and b.month = 1 and b.string_col = "1";{code}
> Profile counters of the runtime filters are all 0:
> {noformat}
>         KUDU_SCAN_NODE (id=0):
>           Filter 0 (1.00 MB):
>              - Files processed: 0 (0)
>              - Files rejected: 0 (0)
>              - Files total: 0 (0)
>              - RowGroups processed: 0 (0)
>              - RowGroups rejected: 0 (0)
>              - RowGroups total: 0 (0)
>              - Rows processed: 0 (0)
>              - Rows rejected: 0 (0)
>              - Rows total: 0 (0)
>              - Splits processed: 0 (0)
>              - Splits rejected: 0 (0)
>              - Splits total: 0 (0)
>           Filter 1 (0):
>              - Files processed: 0 (0)
>              - Files rejected: 0 (0)
>              - Files total: 0 (0)
>              - RowGroups processed: 0 (0)
>              - RowGroups rejected: 0 (0)
>              - RowGroups total: 0 (0)
>              - Rows processed: 0 (0)
>              - Rows rejected: 0 (0)
>              - Rows total: 0 (0)
>              - Splits processed: 0 (0)
>              - Splits rejected: 0 (0)
>              - Splits total: 0 (0){noformat}
> Running the same query on parquet tables gets meaningful counters:
> {noformat}
>         HDFS_SCAN_NODE (id=0):
>           Filter 1 (0):
>              - Files processed: 8 (8)
>              - Files rejected: 6 (6)
>              - Files total: 8 (8)
>              - RowGroups processed: 0 (0)
>              - RowGroups rejected: 0 (0)
>              - RowGroups total: 0 (0)
>              - Rows processed: 0 (0)
>              - Rows rejected: 0 (0)
>              - Rows total: 0 (0)
>              - Splits processed: 2 (2)
>              - Splits rejected: 0 (0)
>              - Splits total: 2 (2)
>           Filter 0 (1.00 MB):
>              - Files processed: 2 (2)
>              - Files rejected: 0 (0)
>              - Files total: 2 (2)
>              - RowGroups processed: 0 (0)
>              - RowGroups rejected: 0 (0)
>              - RowGroups total: 0 (0)
>              - Rows processed: 0 (0)
>              - Rows rejected: 0 (0)
>              - Rows total: 0 (0)
>              - Splits processed: 2 (2)
>              - Splits rejected: 0 (0)
>              - Splits total: 2 (2){noformat}
> KUDU-2162 was filed to add metrics for this. However, the metrics it added 
> are not enough to determine the effect of the filters, i.e. whether some data 
> has been filtered out.
> So far we can only check the number output rows and see if it's smaller than 
> a full scan cardinality. E.g. like this test:
> https://github.com/apache/impala/blob/e1ca23d627532bb17228e3d455c55a03b3e28f49/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test#L29-L30



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to