[jira] [Commented] (CASSANDRA-6588) Add a 'NO EMPTY RESULTS' filter to SELECT

Sylvain Lebresne (JIRA) Sun, 19 Jan 2014 16:30:24 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876059#comment-13876059
 ]


Sylvain Lebresne commented on CASSANDRA-6588:
---------------------------------------------

Let me try to clarity my points/position.

The issue this ticket wants to solve (the "we read all columns of a CQL row") 
is a minor one. I've explained why on CASSANDRA-6586 but the short version is 
that it noticeably affects only a small amount of use cases and even in those 
affected cases there is the workaround of making-two-tables. We could very well 
leave things as they are btw, I'm not throwing away that option, but it feels 
to me that the suggestion of this ticket requires little additional complexity 
(in terms of implementation and of compexity added to the CQL language) and is 
a bit better that the make-two-tables workaround.

Regarding per-row TTL, if we think there is major problems with the current TTL 
design that per-row TTL solves and that justify such revamping, then let's 
maybe add a specific ticket with a rational, a vague description of what the 
semantic would be, what's the backward compatibility story looks like, and 
let's move from there. But my initial anylisis of that is that adding per-row 
TTL today in C* is likely far from simple. I'm not sure how the exact semantic 
could be kept the same as today in particular, even if we ignore the case of 
actually wanting per-column TTL (not that I think we should ignore it :)).  So 
I suspect per-row TTL may break a fair amount of current TTL users. And what 
about thrift? The thrift API pretty clearly assumes that TTL are per-cell, and 
we pinky swore we wouldn't break it. Do we really want to support 2 separate 
TTL implementations internally?  Lastly, it doesn't seems like adding per-row 
TTL will be a trivial amount of code/work to write. Overall, this is a non 
negligible amount of drawbacks for switching to per-row TTL imo (granted, those 
are not intrinsic drawbacks of per-row TTL, but there are still things we'll 
have to deal with).

And the pros of switching to per-row TTL feels rather small? Yes, it might 
allow to fix this issue, but as said above, this issue is minor. And yes, it 
might have a slightly smaller storage footprint than the current mechanism.  
But I've heard no-one complaining that the TTL footprint is a big problem, 
especially now that we have sstable compression, so I really don't think it's a 
big deal. Even combined, those two still sound like pretty minor advantages 
imo. I could probably be convinced to leave per-column TTL out for the sake of 
those advantages if we were to rewrite C* today, but doing it now is not that 
simple.

So I continue to think that doing this issue, and leaving TTL be, feels to me 
like a better practical trade-off: breaking changes shouldn't be considered 
lightly, even if it doesn't break many people, and we have more important 
things to do imo. But again, if you guys think per-row TTL solves important 
problems, let's open a separate ticket and work out the pros and cons in more 
details. I'm happy to hold off on this ticket for now if we do that. Just 
saying it sound to me like a relatively clear waste of time/resources.  I'll 
admit I might be wrong :)


> Add a 'NO EMPTY RESULTS' filter to SELECT
> -----------------------------------------
>
>                 Key: CASSANDRA-6588
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6588
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 2.1
>
>
> It is the semantic of CQL that a (CQL) row exists as long as it has one 
> non-null column (including the PK columns, which, given that no PK columns 
> can be null, means that it's enough to have the PK set for a row to exist). 
> This does means that the result to
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v1 int, v2 int);
> INSERT INTO test(k, v1) VALUES (0, 4);
> SELECT v2 FROM test;
> {noformat}
> must be (and is)
> {noformat}
>  v2
> ------
>  null
> {noformat}
> That fact does mean however that when we only select a few columns of a row, 
> we still need to find out rows that exist but have no values for the selected 
> columns. Long story short, given how the storage engine works, this means we 
> need to query full (CQL) rows even when only some of the columns are selected 
> because that's the only way to distinguish between "the row exists but have 
> no value for the selected columns" and "the row doesn't exist". I'll note in 
> particular that, due to CASSANDRA-5762, we can't unfortunately rely on the 
> row marker to optimize that out.
> Now, when you selects only a subsets of the columns of a row, there is many 
> cases where you don't care about rows that exists but have no value for the 
> columns you requested and are happy to filter those out. So, for those cases, 
> we could provided a new SELECT filter. Outside the potential convenience (not 
> having to filter empty results client side), one interesting part is that 
> when this filter is provided, we could optimize a bit by only querying the 
> columns selected, since we wouldn't need to return rows that exists but have 
> no values for the selected columns.
> For the exact syntax, there is probably a bunch of options. For instance:
> * {{SELECT NON EMPTY(v2, v3) FROM test}}: the vague rational for putting it 
> in the SELECT part is that such filter is kind of in the spirit to DISTINCT.  
> Possibly a bit ugly outside of that.
> * {{SELECT v2, v3 FROM test NO EMPTY RESULTS}} or {{SELECT v2, v3 FROM test 
> NO EMPTY ROWS}} or {{SELECT v2, v3 FROM test NO EMPTY}}: the last one is 
> shorter but maybe a bit less explicit. As for {{RESULTS}} versus {{ROWS}}, 
> the only small object to {{NO EMPTY ROWS}} could be that it might suggest it 
> is filtering non existing rows (I mean, the fact we never ever return non 
> existing rows should hint that it's not what it does but well...) while we're 
> just filtering empty "resultSet rows".
> Of course, if there is a pre-existing SQL syntax for that, it's even better, 
> though a very quick search didn't turn anything. Other suggestions welcome 
> too.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-6588) Add a 'NO EMPTY RESULTS' filter to SELECT

Reply via email to