[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

Sam Tunnicliffe (JIRA) Tue, 28 Apr 2015 07:11:29 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517067#comment-14517067
 ]


Sam Tunnicliffe commented on CASSANDRA-8717:
--------------------------------------------

Generally, I think this is ok for 2.1, modulo a couple of points:

A thing that concerns me is that quite a number of {{AbstractRangeCommand}} 
instances are created during execution of an index query, particularly on the 
coordinator. The primary {{RangeSliceCommand}} is created in 
{{SelectStatement#getPageableCommand}}, then a {{PagedRangeCommand}} is 
instantiated for each page of results in 
{{RangeSliceQueryPager#queryNextPage}}. Then, as each of those is pushed down 
to {{StorageProxy}} the fan out creates another instance per sub range. The 
fact that each of these instances performs the selectivity calculation is a bit 
of a pain and somewhat wasteful, though it shouldn't be overly expensive & the 
alternative would involve more change than I'd be comfortable with putting into 
2.1.

What I do think should be addressed before including this in 2.1 is that quite 
a few more tracing events are now emitted as with tracing enabled, each time 
{{SIS#highestSelectivityPredicate}} is called it we log an event. Perhaps you 
could add a boolean argument to {{highestSelectivityPredicate}} to indicate 
whether or not the tracing event should be emitted. I think that only the call 
from the {{SIS#search}} implementations would actually want to do that, so 
{{SIS#highestSelectivityIndex}} could just pass false. 

This would actually produce fewer trace events than we do currently, only 1 per 
command per replica, but I think that would actually be more correct/useful. 
FTR, I know that the whole process of looking up & selecting a searcher needs 
seriously reworking (& I'm hoping to get chance to do that very soon).

The patch introduces quite a bit of duplication in {{SecondaryIndexManager}} 
between the new {{getHighestSelectivityIndexSearcher}} method and {{search}}. 
It would be straightforward to refactor the latter to call the former.

The wording of the javadoc for {{SIS#postReconciliationProcessing}} could be a 
little clearer. In its current form, it could be taken to imply that a number 
of index queries may be executed by an index searcher instance on a single node 
& their results combined. I think it would be better to make it clear that 
reconcilliation happens on the coordinator and not on the replica(s) where the 
search happens.

> Top-k queries with custom secondary indexes
> -------------------------------------------
>
>                 Key: CASSANDRA-8717
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Andrés de la Peña
>            Assignee: Andrés de la Peña
>            Priority: Minor
>              Labels: 2i, secondary_index, sort, sorting, top-k
>             Fix For: 3.0
>
>         Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch, 
> 0002-Add-support-for-top-k-queries-in-2i.patch, 
> 0003-Add-support-for-top-k-queries-in-2i.patch
>
>
> As presented in [Cassandra Summit Europe 
> 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
> modified to support general top-k queries with minimum changes in Cassandra 
> codebase. This way, custom 2i implementations could provide relevance search, 
> sorting by columns, etc.
> Top-k queries retrieve the k best results for a certain query. That implies 
> querying the k best rows in each token range and then sort them in order to 
> obtain the k globally best rows. 
> For doing that, we propose two additional methods in class 
> SecondaryIndexSearcher:
> {code:java}
> public boolean requiresFullScan(List<IndexExpression> clause)
> {
>     return false;
> }
> public List<Row> sort(List<IndexExpression> clause, List<Row> rows)
> {
>     return rows;
> }
> {code}
> The first one indicates if a query performed in the index requires querying 
> all the nodes in the ring. It is necessary in top-k queries because we do not 
> know which node are the best results. The second method specifies how to sort 
> all the partial node results according to the query. 
> Then we add two similar methods to the class AbstractRangeCommand:
> {code:java}
>     this.searcher = 
> Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
> public boolean requiresFullScan() {
>     return searcher == null ? false : searcher.requiresFullScan(rowFilter);
> }
> public List<Row> combine(List<Row> rows)
> {
>     return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
> rows));
> }
> {code}
> Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
> shown in the attached patch.
> We think that the proposed approach provides very useful functionality with 
> minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

Reply via email to