[jira] [Commented] (KUDU-2162) Expose stats about scan filters

Will Berkeley (JIRA) Wed, 04 Oct 2017 12:35:23 -0700

    [ 
https://issues.apache.org/jira/browse/KUDU-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191890#comment-16191890
 ]


Will Berkeley commented on KUDU-2162:
-------------------------------------

[~twmarshall]

Let me clarify something-- the scanner metrics I'm working on at the moment 
would be for a scanner on a particular tablet, not metrics for the whole scan. 
Of course, the whole-scan metrics are the sum of the tablet-level metrics, but 
Kudu doesn't aggregate that information-- we've left it up to systems like CM 
to do that.

Also, right now these metrics are exposed, AFAIK, only through the /metrics 
HTTP endpoint, as JSON.

How would Impala like to consume the metrics? The HTTP endpoint isn't a great 
option. I like the idea of returning some metrics with each scan batch, so each 
fragment of an impala query would get metrics about its part of the scan. The 
metrics could be aggregated in the coordinator.

bq. why is it difficult to get the number of blocks skipped?

It's more work, I guess not difficult per se. We already keep track of the 
number of blocks read and it's trivial to expose it to the metrics subsystem. 
We don't store a total number of blocks in a cfile, so we'd need to add that as 
new metadata (likely easiest path) or do more accounting for blocks skipped in 
the iteration code.

bq.  Is there a way to get the total number of blocks for a table? 

Basically, table-level metrics don't exist right now. A table is a master-level 
concept; tablet servers just know what tablets they hold. Tablet servers have 
metrics on the size of their tablets, but something like CM is needed to take 
all the tablet metrics and aggregate them into table metrics, because no 
metrics are centralized in the master.

bq. And what about # of partitions that are skipped, as Dan suggested?

To determine that, I think we already have the information needed in the 
client: if the table has 100 tablets, but a scanner serializes into 50 scan 
token, then 50 tablets (partitions) are being skipped. We don't (yet) support 
splitting scan tokens finer than by partition.

Note it's possible that a scan token will be made for a partition with no 
matching rows, either because other predicates eliminate all data or our 
partition pruning logic isn't as good as it could be. I don't know if you'd 
count that as skipped, but there is always at least a round trip to a tablet 
server in that case. This suggests another interesting metric for a scan: 
number of partition scanned but returning no data.

> Expose stats about scan filters
> -------------------------------
>
>                 Key: KUDU-2162
>                 URL: https://issues.apache.org/jira/browse/KUDU-2162
>             Project: Kudu
>          Issue Type: Improvement
>          Components: client
>            Reporter: Thomas Tauber-Marshall
>
> Impala is working on implementing runtime filters that get pushed down into 
> Kudu using KuduScanner::AddConjunctPredicate()
> It would be useful for perf analysis and debugging to be able to include info 
> in Impala's runtime profile about the effectiveness of the filters, eg. 
> number of rows that are filtered.
> This would probably require at least two counters:
> - # of blocks that are entirely skipped
> - # of rows that are filtered from blocks that aren't entirely skipped



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KUDU-2162) Expose stats about scan filters

Reply via email to