[jira] [Updated] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

David Alves (JIRA) Fri, 31 Aug 2012 17:53:09 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Alves updated CASSANDRA-1337:
-----------------------------------

    Attachment: 1337.patch

Clean rehash that addressed Sylvain's (very helpful comments) including 
implementing for the CQL3 case. It estimated concurrency factor the following 
ways:

- Primary Indexes + Thrift - divides cfs by RF
- 2ndary indexes + Thrift - uses the mean col count of the most selective index 
to estimate the number of keys
- CQL3 + IdentityFilter - uses the estimated keys + mean col count to estimate 
cols per node
- CQL3 + Names filter - assumes cols with names are present and uses estimated 
keys to calculate cols per node
- CQL3 - Other filters - as sylvain mentioned because we have no idea on the 
selectivity of the col filter we cannot estimate how many cols will be returned 
per node so we revert to concurrecy factor = 1.

Reimplemented parallel the parallel execution part to make it a lot cleaner IMO 
(previous implementation was adapting sequential execution which made it 
difficult to read)

cql_test.py dtest is failing in the same place as trunk ,need to look into it 
to make sure Sylvain's dtest passes
                
> parallelize fetching rows for low-cardinality indexes
> -----------------------------------------------------
>
>                 Key: CASSANDRA-1337
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: David Alves
>            Priority: Minor
>             Fix For: 1.2.1
>
>         Attachments: 1137-bugfix.patch, 1337.patch, 
> ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
>  CASSANDRA-1337.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> currently, we read the indexed rows from the first node (in partitioner 
> order); if that does not have enough matching rows, we read the rows from the 
> next, and so forth.
> we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
> parallel, such that we have a high chance of getting enough rows w/o having 
> to do another round of queries (but, if our estimate is incorrect, we do need 
> to loop and do more rounds until we have enough data or we have fetched from 
> each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

Reply via email to