Re: How does Cassandra optimize this query?

2012-11-05 Thread Sylvain Lebresne
On Mon, Nov 5, 2012 at 4:12 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Is this query the equivalent of a full table scan?  Without a starting
 point get_range_slice is just starting at token 0?


It is, but that's what you asked for after all. If you want to start at a
given token you can do:
  SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
'whatevertokenyouwant'
You can even do:
  SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
token(99051fe9-6a9c-46c2-b949-38ef78858dd0)
if that's simpler for you than computing the token manually. Though that is
mostly for random partitioners. For ordered ones, you can do without the
token() part.

--
Sylvain


Re: How does Cassandra optimize this query?

2012-11-05 Thread Edward Capriolo
I see. It is fairly misleading because it is a query that does not
work at scale. This syntax is only helpful if you have less then a few
thousand rows in Cassandra.

On Mon, Nov 5, 2012 at 12:24 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Mon, Nov 5, 2012 at 4:12 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Is this query the equivalent of a full table scan?  Without a starting
 point get_range_slice is just starting at token 0?


 It is, but that's what you asked for after all. If you want to start at a
 given token you can do:
   SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
 'whatevertokenyouwant'
 You can even do:
   SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
 token(99051fe9-6a9c-46c2-b949-38ef78858dd0)
 if that's simpler for you than computing the token manually. Though that is
 mostly for random partitioners. For ordered ones, you can do without the
 token() part.

 --
 Sylvain


Re: How does Cassandra optimize this query?

2012-11-05 Thread Sylvain Lebresne
On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I see. It is fairly misleading because it is a query that does not
 work at scale. This syntax is only helpful if you have less then a few
 thousand rows in Cassandra.


Just for the sake of argument, how is that misleading? If you have billions
of rows and do the select statement from you initial mail, what did the
syntax lead you to believe it would return?

A remark like maybe we just shouldn't allow that and leave that to the
map-reduce side would make sense, but I don't see how this is misleading.

But again, this translate directly to a get_range_slice (that don't scale
if you have billion of rows and don't limit the output either) so there is
nothing new here.


Re: How does Cassandra optimize this query?

2012-11-05 Thread Edward Capriolo
 A remark like maybe we just shouldn't allow that and leave that to the
 map-reduce side would make sense, but I don't see how this is misleading.

Yes. Bingo.

It is misleading because it is not useful in any other context besides
someone playing around with a ten row table in cqlsh. CQL stops me
from executing some queries that are not efficient, yet it allows this
one. If I am new to Cassandra and developing, this query works and
produces a result then once my database gets real data produces a
different result (likely an empty one).

When I first saw this query two things came to my mind.

1) CQL (and Cassandra) must be somehow indexing all the fields of a
primary key to make this search optimal.

2) This is impossible CQL must be gathering the first hundred random
rows and finding this thing.

What it is happening is case #2. In a nutshell CQL is just sampling
some data and running the query on it. We could support all types of
query constructs if we just take the first 100 rows and apply this
logic to it, but these things are not helpful for anything but light
ad-hoc data exploration.

My suggestions:
1) force people to supply a LIMIT clause on any query that is going to
page over get_range_slice
2) having some type of explain support so I can establish if this
query will work in the

I say this because as an end user I do not understand if a given query
is actually going to return the same results with different data.

On Mon, Nov 5, 2012 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com wrote:

 On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 I see. It is fairly misleading because it is a query that does not
 work at scale. This syntax is only helpful if you have less then a few
 thousand rows in Cassandra.


 Just for the sake of argument, how is that misleading? If you have billions
 of rows and do the select statement from you initial mail, what did the
 syntax lead you to believe it would return?

 A remark like maybe we just shouldn't allow that and leave that to the
 map-reduce side would make sense, but I don't see how this is misleading.

 But again, this translate directly to a get_range_slice (that don't scale if
 you have billion of rows and don't limit the output either) so there is
 nothing new here.


Re: How does Cassandra optimize this query?

2012-11-05 Thread Sylvain Lebresne
Ok, I slightly misunderstood your initial complain, my bad. I largely agree
with you, though I'm more conflicted on what the right resolution is. But
I'll follow up on the ticket to avoid repetition.


On Mon, Nov 5, 2012 at 10:42 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I created https://issues.apache.org/jira/browse/CASSANDRA-4915

 On Mon, Nov 5, 2012 at 3:27 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
  A remark like maybe we just shouldn't allow that and leave that to the
  map-reduce side would make sense, but I don't see how this is
 misleading.
 
  Yes. Bingo.
 
  It is misleading because it is not useful in any other context besides
  someone playing around with a ten row table in cqlsh. CQL stops me
  from executing some queries that are not efficient, yet it allows this
  one. If I am new to Cassandra and developing, this query works and
  produces a result then once my database gets real data produces a
  different result (likely an empty one).
 
  When I first saw this query two things came to my mind.
 
  1) CQL (and Cassandra) must be somehow indexing all the fields of a
  primary key to make this search optimal.
 
  2) This is impossible CQL must be gathering the first hundred random
  rows and finding this thing.
 
  What it is happening is case #2. In a nutshell CQL is just sampling
  some data and running the query on it. We could support all types of
  query constructs if we just take the first 100 rows and apply this
  logic to it, but these things are not helpful for anything but light
  ad-hoc data exploration.
 
  My suggestions:
  1) force people to supply a LIMIT clause on any query that is going to
  page over get_range_slice
  2) having some type of explain support so I can establish if this
  query will work in the
 
  I say this because as an end user I do not understand if a given query
  is actually going to return the same results with different data.
 
  On Mon, Nov 5, 2012 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
  I see. It is fairly misleading because it is a query that does not
  work at scale. This syntax is only helpful if you have less then a few
  thousand rows in Cassandra.
 
 
  Just for the sake of argument, how is that misleading? If you have
 billions
  of rows and do the select statement from you initial mail, what did the
  syntax lead you to believe it would return?
 
  A remark like maybe we just shouldn't allow that and leave that to the
  map-reduce side would make sense, but I don't see how this is
 misleading.
 
  But again, this translate directly to a get_range_slice (that don't
 scale if
  you have billion of rows and don't limit the output either) so there is
  nothing new here.



Re: How does Cassandra optimize this query?

2012-11-04 Thread Sylvain Lebresne
On Sun, Nov 4, 2012 at 7:49 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 CQL3 Allows me to search the second component of a primary key. Which
 really just seems to be component 1 of a composite column.

 So what thrift operation does this correspond to? This looks like a
 column slice without specifying a key? How does this work internally?


get_range_slice (with the right slice predicate to select the columns where
the first component == 'My funny cat')

--
Sylvain