Re: How does Cassandra optimize this query?
On Mon, Nov 5, 2012 at 4:12 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Is this query the equivalent of a full table scan? Without a starting point get_range_slice is just starting at token 0? It is, but that's what you asked for after all. If you want to start at a given token you can do: SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 'whatevertokenyouwant' You can even do: SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) token(99051fe9-6a9c-46c2-b949-38ef78858dd0) if that's simpler for you than computing the token manually. Though that is mostly for random partitioners. For ordered ones, you can do without the token() part. -- Sylvain
Re: How does Cassandra optimize this query?
I see. It is fairly misleading because it is a query that does not work at scale. This syntax is only helpful if you have less then a few thousand rows in Cassandra. On Mon, Nov 5, 2012 at 12:24 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Mon, Nov 5, 2012 at 4:12 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Is this query the equivalent of a full table scan? Without a starting point get_range_slice is just starting at token 0? It is, but that's what you asked for after all. If you want to start at a given token you can do: SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 'whatevertokenyouwant' You can even do: SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) token(99051fe9-6a9c-46c2-b949-38ef78858dd0) if that's simpler for you than computing the token manually. Though that is mostly for random partitioners. For ordered ones, you can do without the token() part. -- Sylvain
Re: How does Cassandra optimize this query?
On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.comwrote: I see. It is fairly misleading because it is a query that does not work at scale. This syntax is only helpful if you have less then a few thousand rows in Cassandra. Just for the sake of argument, how is that misleading? If you have billions of rows and do the select statement from you initial mail, what did the syntax lead you to believe it would return? A remark like maybe we just shouldn't allow that and leave that to the map-reduce side would make sense, but I don't see how this is misleading. But again, this translate directly to a get_range_slice (that don't scale if you have billion of rows and don't limit the output either) so there is nothing new here.
Re: How does Cassandra optimize this query?
A remark like maybe we just shouldn't allow that and leave that to the map-reduce side would make sense, but I don't see how this is misleading. Yes. Bingo. It is misleading because it is not useful in any other context besides someone playing around with a ten row table in cqlsh. CQL stops me from executing some queries that are not efficient, yet it allows this one. If I am new to Cassandra and developing, this query works and produces a result then once my database gets real data produces a different result (likely an empty one). When I first saw this query two things came to my mind. 1) CQL (and Cassandra) must be somehow indexing all the fields of a primary key to make this search optimal. 2) This is impossible CQL must be gathering the first hundred random rows and finding this thing. What it is happening is case #2. In a nutshell CQL is just sampling some data and running the query on it. We could support all types of query constructs if we just take the first 100 rows and apply this logic to it, but these things are not helpful for anything but light ad-hoc data exploration. My suggestions: 1) force people to supply a LIMIT clause on any query that is going to page over get_range_slice 2) having some type of explain support so I can establish if this query will work in the I say this because as an end user I do not understand if a given query is actually going to return the same results with different data. On Mon, Nov 5, 2012 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I see. It is fairly misleading because it is a query that does not work at scale. This syntax is only helpful if you have less then a few thousand rows in Cassandra. Just for the sake of argument, how is that misleading? If you have billions of rows and do the select statement from you initial mail, what did the syntax lead you to believe it would return? A remark like maybe we just shouldn't allow that and leave that to the map-reduce side would make sense, but I don't see how this is misleading. But again, this translate directly to a get_range_slice (that don't scale if you have billion of rows and don't limit the output either) so there is nothing new here.
Re: How does Cassandra optimize this query?
Ok, I slightly misunderstood your initial complain, my bad. I largely agree with you, though I'm more conflicted on what the right resolution is. But I'll follow up on the ticket to avoid repetition. On Mon, Nov 5, 2012 at 10:42 PM, Edward Capriolo edlinuxg...@gmail.comwrote: I created https://issues.apache.org/jira/browse/CASSANDRA-4915 On Mon, Nov 5, 2012 at 3:27 PM, Edward Capriolo edlinuxg...@gmail.com wrote: A remark like maybe we just shouldn't allow that and leave that to the map-reduce side would make sense, but I don't see how this is misleading. Yes. Bingo. It is misleading because it is not useful in any other context besides someone playing around with a ten row table in cqlsh. CQL stops me from executing some queries that are not efficient, yet it allows this one. If I am new to Cassandra and developing, this query works and produces a result then once my database gets real data produces a different result (likely an empty one). When I first saw this query two things came to my mind. 1) CQL (and Cassandra) must be somehow indexing all the fields of a primary key to make this search optimal. 2) This is impossible CQL must be gathering the first hundred random rows and finding this thing. What it is happening is case #2. In a nutshell CQL is just sampling some data and running the query on it. We could support all types of query constructs if we just take the first 100 rows and apply this logic to it, but these things are not helpful for anything but light ad-hoc data exploration. My suggestions: 1) force people to supply a LIMIT clause on any query that is going to page over get_range_slice 2) having some type of explain support so I can establish if this query will work in the I say this because as an end user I do not understand if a given query is actually going to return the same results with different data. On Mon, Nov 5, 2012 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I see. It is fairly misleading because it is a query that does not work at scale. This syntax is only helpful if you have less then a few thousand rows in Cassandra. Just for the sake of argument, how is that misleading? If you have billions of rows and do the select statement from you initial mail, what did the syntax lead you to believe it would return? A remark like maybe we just shouldn't allow that and leave that to the map-reduce side would make sense, but I don't see how this is misleading. But again, this translate directly to a get_range_slice (that don't scale if you have billion of rows and don't limit the output either) so there is nothing new here.
Re: How does Cassandra optimize this query?
On Sun, Nov 4, 2012 at 7:49 PM, Edward Capriolo edlinuxg...@gmail.comwrote: CQL3 Allows me to search the second component of a primary key. Which really just seems to be component 1 of a composite column. So what thrift operation does this correspond to? This looks like a column slice without specifying a key? How does this work internally? get_range_slice (with the right slice predicate to select the columns where the first component == 'My funny cat') -- Sylvain