[jira] [Comment Edited] (CASSANDRA-6048) Add the ability to use multiple indexes in a single query

Alex Liu (JIRA) Fri, 18 Oct 2013 13:14:41 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799360#comment-13799360
 ]


Alex Liu edited comment on CASSANDRA-6048 at 10/18/13 8:13 PM:
---------------------------------------------------------------

A good summary of different index scans in PostgresSQL.  
https://gist.github.com/hgmnz/883144

h4. Index scan vs Bitmap Index Scan vs Merge join

h5. Index Scan 
reads the index in alternation, bouncing between table and index, row at a 
time. 
    Performance gain when reading a few rows from a big table. It goes to the 
index, finds the rows
    Only method of accessing a table the can return rows in sorted order - 
useful in combination with LIMIT.
    Disadvantage: Causes random I/O against the base table. Read a row from the 
index, then a row from the table, and so on.

h5. Bitmap Index Scan 
    scans all index rows before examining base table.
 
    Can efficiently combine data from multiple indeces. the TID bitmap can 
handle boolean ANDs and ORs.
    Handles LIMIT poorly.
    Adjust page cost parameters to use Index scans vs Bitamp scans on queries 
where it makes sense.
    SMALL INDEXES ARE BETTER. Less columns are better. Avoid multicolumn 
indexes.


h5. Merge join
Puts both input relations in sorted order (sort or index scan), and match up on 
equal values by scanning through the two in parallel.
Only way to handle really big datasets, however merge joins are slower.

My approach is combing Index scan and merge join of indexes which handle limit 
clause better and keep the order. The index + loop approach is Index scan on 
the primary index.
We can use Bitmap Index Scan for none-limit clause query and result sets are 
not in order.

We have the following options
* index scan (on primary index) + loop                     -- we are currently 
doing
* bitmap scan (on primary index) + loop                    -- reduce random I/O 
(TODO)
* bitmap index join (join all indexes into bitmap)  + bitmap scan -- reduce 
random I/O and less base table scan (TODO)
* index merge join(merge join all indexes) + index scan   -- quite index random 
I/O only good for larger indexes (my first approach)


was (Author: alexliu68):
A good summary of different index scans in PostgresSQL.  
https://gist.github.com/hgmnz/883144

h4. Index scan vs Bitmap Index Scan vs Merge join

h5. Index Scan 
reads the index in alternation, bouncing between table and index, row at a 
time. 
    Performance gain when reading a few rows from a big table. It goes to the 
index, finds the rows
    Only method of accessing a table the can return rows in sorted order - 
useful in combination with LIMIT.
    Disadvantage: Causes random I/O against the base table. Read a row from the 
index, then a row from the table, and so on.

h5. Bitmap Index Scan 
    scans all index rows before examining base table.
 
    Can efficiently combine data from multiple indeces. the TID bitmap can 
handle boolean ANDs and ORs.
    Handles LIMIT poorly.
    Adjust page cost parameters to use Index scans vs Bitamp scans on queries 
where it makes sense.
    SMALL INDEXES ARE BETTER. Less columns are better. Avoid multicolumn 
indexes.


h5. Merge join
Puts both input relations in sorted order (sort or index scan), and match up on 
equal values by scanning through the two in parallel.
Only way to handle really big datasets, however merge joins are slower.

My approach is combing Index scan and merge join of indexes which handle limit 
clause better and keep the order. The index + loop approach is Index scan on 
the primary index.
We can use Bitmap Index Scan for none-limit clause query and result sets are 
not in order.

> Add the ability to use multiple indexes in a single query
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-6048
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6048
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Alex Liu
>            Assignee: Alex Liu
>             Fix For: 2.1
>
>         Attachments: 6048-1.2-branch.txt, 6048-trunk.txt
>
>
> Existing data filtering uses the following algorithm
> {code}
>    1. find best selective predicate based on the smallest mean columns count
>    2. fetch rows for the best selective predicate predicate, then filter the 
> data based on other predicates left.
> {code}
> So potentially we could improve the performance by
> {code}
>    1.  joining multiple predicates then do the data filtering for other 
> predicates.
>    2.  fine tune the best predicate selection algorithm
> {code}
> For multiple predicate join, it could improve performance if one predicate 
> has many entries and another predicate has a very few of entries. It means a 
> few index CF read, join the row keys, fetch rows then filter other predicates
> Another approach is to have index on multiple columns.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Comment Edited] (CASSANDRA-6048) Add the ability to use multiple indexes in a single query

Reply via email to