[GitHub] [pinot] richardstartin opened a new issue #8453: Evaluate cheap filters before expensive filters

GitBox Thu, 31 Mar 2022 15:06:16 -0700


richardstartin opened a new issue #8453:
URL: https://github.com/apache/pinot/issues/8453



   The plan for the query below is extremely inefficient:
   
   ```sql
   select * from githubEvents where repo_name like 'FeedScale%' and type = 
'PullRequestEvent' limit 100
   ```
   
   | Operator | Operator_Id |   Parent_Id |
   |----------|---------------|------------|
   | BROKER_REDUCE(limit:100) | 0 |     -1 |
   | COMBINE_SELECT |   1 |     0 |
   | SELECT(selectList:actor_id, actor_login, actor_url, created_at, 
event_time, event_time_ts, id, public, pull_request_additions, 
pull_request_changed_files, pull_request_comments, pull_request_commits, 
pull_request_deletions, pull_request_id, pull_request_review_comments, 
pull_request_state, pull_request_title, pull_request_url, repo_id, repo_name, 
repo_url, type)       | 2     | 1 |
   | TRANSFORM_PASSTHROUGH(actor_id, actor_login, actor_url, created_at, 
event_time, event_time_ts, id, public, pull_request_additions, 
pull_request_changed_files, pull_request_comments, pull_request_commits, 
pull_request_deletions, pull_request_id, pull_request_review_comments, 
pull_request_state, pull_request_title, pull_request_url, repo_id, repo_name, 
repo_url, type) | 3 |     2 |
   | PROJECT(repo_url, pull_request_comments, pull_request_changed_files, 
actor_login, pull_request_deletions, created_at, pull_request_url, type, 
pull_request_id, event_time_ts, public, pull_request_review_comments, 
pull_request_title, pull_request_state, repo_id, actor_url, repo_name, 
actor_id, id, pull_request_additions, pull_request_commits, event_time) | 4     
| 3 |
   | FILTER_AND |       5 |     4 |
   | 
FILTER_INVERTED_INDEX(indexLookUp:inverted_index,operator:EQ,predicate:type = 
'PullRequestEvent') |        6 |     5 |
   | 
FILTER_FULL_SCAN(operator:REGEXP_LIKE,predicate:regexp_like(repo_name,'^FeedScale.*$'))
 |  7 |     5 |
   
   Instead of iterating over the result of the `FILTER_INVERTED_INDEX` 
operator, the filter is done in parallel with a full scan and then is 
intersected with other filters. 
   
   This as a few implications:
   
   1. no amount of filtering will ever make this query produce a result without 
timing out unless the user has a text index
   2. not even the limit can prevent the query from scanning the entire table 
because there may be fewer matches than the limit
   3. Pinot users shouldn't put any expression in a where clause which isn't 
guaranteed to match an index
   
   Cheap filters should be pushed down before expensive filters. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] richardstartin opened a new issue #8453: Evaluate cheap filters before expensive filters

Reply via email to