[
https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491820#comment-13491820
]
Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------
bq. and relying on the limit being X vs 10X or 0.1X is silly
Why I agree on the silliness, I don't fully share your optimism that people
won't start relying on it. I also would prefer being able to clearly specify
that "without limit we return as much result as there is with the technical
limitation that it's Integer.MAX_VALUE", rather than having to settle for
"without limit we return results with a limit that depends on the weather and
the exact value of which you shouldn't rely on". I also think that having an
arbitrary default limit is a very bad OOM protection (I think it's still fairly
easy to OOM even with the 10,000 limit unless you are mindful of your query).
But I'd rather discuss that in CASSANDRA-4918 for the sake of not mixing
unrelated issues.
Because I do think there is an issue here that has nothing to do whatsoever
with the limit and preventing OOMing. That issue is that we allow some queries
that do not scale with the number of records in the database. And to be clear,
'not scale with the number of records in the database' means that even for a
*constant* query output it doesn't scale. Those queries are:
# the one in the description of this ticket
# as Jonathan said (and I don't disagree with it's statement), secondary index
queries with additional restrictions.
Now I agree that we can't completely protect people against those short of
refusing the queries. But I do think we have some discrepancies in what we
support and don't support: we refuse 'SELECT * FROM t WHERE partition_key = ..
AND clustering_key_part2 = ...' based on the argument than because
clustering_key_part1 is not provided, we would have to do a full scan of the
internal row and the inefficiency of that would be too surprising for the user.
But we do allow the query in the description of this ticket even though
honestly it's the same kind of query (I.e, it's a query where we don't have
*any* index to really start with).
And I don't like discrepancies. Or in other words, we've claimed that an
advantage of Cassandra is that that query performance is predictable, but
queries that for the same output (even a very small one) have an execution time
that is proportional to the number of record in the database is imho the exact
definition of query performance being non predictable (or at least
non-scalable). So I think it would be of interest to clarify what it is exactly
that we guarantee in term of query performance being predictable. And for that
I see a number of options:
# We leave thing as they are, but then the rule of when a query will have a
predicable performance (which for me means that the performance will be almost
only dependant on the query output) are fairly opaque and not very coherent.
And in particular in that case it feels random to refuse queries that would
require a full internal row scan when we happily do the ones that require an
entire ring scan.
# We get strict about allowing only queries that we can guarantee have
predictable performance (with the definition above that I think is reasonable).
That does mean refusing the query in the description, but also indeed queries
on 2ndary indexes that have more than one restriction, which probably make that
solution too restrictive to be desirable.
# We try to hit some middle ground, where while we allow some guarantee we
can't guarantee the predictability, we at least make it so that the rule for
when the predictability is guaranteed easy to understand/follow. My proposition
for "ALLOW FULL SCAN" above was a tentative of that. If we allow that, and
unless I forget something which is possible, I think we can say that: a query
will have predictable performance unless it either use 2ndary index or it uses
'allow full scan'. And for 2ndary index we can refine that a bit and say 'it
still will have guaranteed predictable performance if you only use one
restriction in the query'. But at least, we'd have clear guarantee without
2ndary index, and I do thing that 1) it's very useful and 2) it's not crazy to
say that 2ndary index involves more complex processing and offer thus less
guarantee in term of predictability.
In favor of my third point, I want to mention that this is exactly the
guarantee that thrift provides today, because today a non-2ndary query in
thrift always give you predictable performance in the sense that the query
performance will be proportional to the query ouptut (that you can control with
the limit), because a get_range_slice in thrift (without IndexExpression) with
a count of 1 will only ever scan one row (and if that one row doesn't have
anything for the filter, the result will be an empty row), but that is *not*
how CQL3 works today.
> CQL should force limit when query samples data.
> -----------------------------------------------
>
> Key: CASSANDRA-4915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
> Project: Cassandra
> Issue Type: Improvement
> Affects Versions: 1.2.0 beta 1
> Reporter: Edward Capriolo
> Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
> videoid uuid,
> videoname varchar,
> username varchar,
> description varchar,
> tags varchar,
> upload_date timestamp,
> PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query
> is fast because Cassandra is performing an optimized query (over an index, or
> using a slicePredicate) or if cassandra is simple sampling some random rows
> and returning me some results.
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique
> un-sql like position by applying an automatic limit clause without the user
> asking for them. I also do not believe the CQL language should let the user
> issue queries that will not work as intended with "larger-then-auto-limit"
> size data sets.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira