[jira] [Commented] (CASSANDRA-7016) can't map/reduce over subset of rows with cql

Sylvain Lebresne (JIRA) Mon, 25 Aug 2014 06:40:13 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109112#comment-14109112
 ]


Sylvain Lebresne commented on CASSANDRA-7016:
---------------------------------------------

A few remarks:
* We should move the job of {{mergeKeyRestrictions}} to the beginning of the 
loop of {{processPartitionKeyRestrictions}}. No reason to incur this on each 
execution (versus once a preparation time), and it'll avoid the multiple 
inlining of {{stmt.keyRestrictions\[i\]}} inside the loop.
* In TokenFilter, we shouldn't use {{ByteBuffer}} comparison but {{Token}} 
comparison (so we should have a RangeSet<Token>, etc...).
* In the test, I'd have a preference for reducing the number of test functions 
by merging those that apply to the same table (especially since we're only 
testing selects). Re-creating the table for every test adds up to the test 
running time which is pretty long already. Plus, I personally find that 5 lines 
of clutter for every one true line of test is a bit excessive, and reducing the 
number of test methods cuts that down. Note that I understand that having 
multiple test lines in the same test function mean that an early failure will 
"hide" following ones, but since the number of acceptable test failures is 0 
anyway, I consider that readability and execution times is more important.
* Regarding the test, you also don't need to have 2 tests like:
 {noformat}
 execute("SELECT * FROM %s WHERE token(a) > token(?)", 2)
 execute("SELECT * FROM %s WHERE token(a) > token(2)")
 {noformat}
 The first form is enough, CQLTester is smart enough to generate the 2nd form 
automatically. This also mean that ideally all statements would have the first 
"prepared" form (so we get both prepared and non-prepared tests for free).


> can't map/reduce over subset of rows with cql
> ---------------------------------------------
>
>                 Key: CASSANDRA-7016
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7016
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core, Hadoop
>            Reporter: Jonathan Halliday
>            Assignee: Benjamin Lerer
>            Priority: Minor
>              Labels: cql
>             Fix For: 2.1.1
>
>         Attachments: CASSANDRA-7016-V2.txt, CASSANDRA-7016.txt
>
>
> select ... where token(k) < x and token(k) >= y and k in (a,b) allow 
> filtering;
> This fails on 2.0.6: can't restrict k by more than one relation.
> In the context of map/reduce (hence the token range) I want to map over only 
> a subset of the keys (hence the 'in').  Pushing the 'in' filter down to cql 
> is substantially cheaper than pulling all rows to the client and then 
> discarding most of them.
> Currently this is possible only if the hadoop integration code is altered to 
> apply the AND on the client side and use cql that contains only the resulting 
> filtered 'in' set.  The problem is not hadoop specific though, so IMO it 
> should really be solved in cql not the hadoop integration code.
> Most restrictions on cql syntax seem to exist to prevent unduly expensive 
> queries. This one seems to be doing the opposite.
> Edit: on further thought and with reference to the code in 
> SelectStatement$RawStatement, it seems to me that  token(k) and k should be 
> considered distinct entities for the purposes of processing restrictions. 
> That is, no restriction on the token should conflict with a restriction on 
> the raw key. That way any monolithic query in terms of k and be decomposed 
> into parallel chunks over the token range for the purposes of map/reduce 
> processing simply by appending a 'and where token(k)...' clause to the 
> exiting 'where k ...'.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7016) can't map/reduce over subset of rows with cql

Reply via email to