[ 
https://issues.apache.org/jira/browse/CASSANDRA-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921074#comment-13921074
 ] 

Sylvain Lebresne commented on CASSANDRA-6588:
---------------------------------------------

It occured to me that it's not at all impossible to optimize this all out at 
the storage engine level (and that's probably the right solution).

Let me first sum up quickly the problem we're actually trying to solve here: 
when you query just one CQL row and only select some of it's columns (and 
*only* in that), we can't use a NamesQueryFilter underneath just because if we 
get back no result we're not able to distinguish between "the row exists but 
has not data for those columns that have been selected" and "the row doesn't 
exist".  So instead we currently issue a SliceQueryFilter for the whole CQL 
row, which can be slower than if we were able to use a NamesQueryFilter because:
# NamesQueryFilter uses the CollationController.collectTimeOrderedData() path, 
that can potentially skip some sstables.
# NamesQueryFilter avoids sending the value for the columns of the CQL row that 
are not selected to the coordinator to have them ignored later (it doesn't 
matter so much as far as disk reading is concerned since we don't really read 
cells from disk one by one).

So anyway, we could specialize a new RowQueryFilter (which would be the new 
NamesQueryFilter for CQL3 tables). That filter would use the 
collectTimeOrderedData() path and would only return the columns queried (+ the 
row marker), but at the sstable level, it would read from the beginning of the 
CQL row and as soon as it encounter a live column, it would add the row marker 
to the result, but otherwise it would skip any column that is not part of the 
selected ones. In other words, why we can't rely on the row marker being here 
due to TTL, it's not too hard when deserializing the sstable to generate a 
"fake one" for the purpose of the query, but to do avoid doing any extra work 
otherwise.

As a side note, we could actually reuse that same idea for SliceQueryFilter 
(i.e. have a slice filter but that only care about a subset of the CQL row 
columns), which would improve the case for slices (when you only select a 
subset of the columns that is).


> Add a 'NO EMPTY RESULTS' filter to SELECT
> -----------------------------------------
>
>                 Key: CASSANDRA-6588
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6588
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 2.1 beta2
>
>
> It is the semantic of CQL that a (CQL) row exists as long as it has one 
> non-null column (including the PK columns, which, given that no PK columns 
> can be null, means that it's enough to have the PK set for a row to exist). 
> This does means that the result to
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v1 int, v2 int);
> INSERT INTO test(k, v1) VALUES (0, 4);
> SELECT v2 FROM test;
> {noformat}
> must be (and is)
> {noformat}
>  v2
> ------
>  null
> {noformat}
> That fact does mean however that when we only select a few columns of a row, 
> we still need to find out rows that exist but have no values for the selected 
> columns. Long story short, given how the storage engine works, this means we 
> need to query full (CQL) rows even when only some of the columns are selected 
> because that's the only way to distinguish between "the row exists but have 
> no value for the selected columns" and "the row doesn't exist". I'll note in 
> particular that, due to CASSANDRA-5762, we can't unfortunately rely on the 
> row marker to optimize that out.
> Now, when you selects only a subsets of the columns of a row, there is many 
> cases where you don't care about rows that exists but have no value for the 
> columns you requested and are happy to filter those out. So, for those cases, 
> we could provided a new SELECT filter. Outside the potential convenience (not 
> having to filter empty results client side), one interesting part is that 
> when this filter is provided, we could optimize a bit by only querying the 
> columns selected, since we wouldn't need to return rows that exists but have 
> no values for the selected columns.
> For the exact syntax, there is probably a bunch of options. For instance:
> * {{SELECT NON EMPTY(v2, v3) FROM test}}: the vague rational for putting it 
> in the SELECT part is that such filter is kind of in the spirit to DISTINCT.  
> Possibly a bit ugly outside of that.
> * {{SELECT v2, v3 FROM test NO EMPTY RESULTS}} or {{SELECT v2, v3 FROM test 
> NO EMPTY ROWS}} or {{SELECT v2, v3 FROM test NO EMPTY}}: the last one is 
> shorter but maybe a bit less explicit. As for {{RESULTS}} versus {{ROWS}}, 
> the only small object to {{NO EMPTY ROWS}} could be that it might suggest it 
> is filtering non existing rows (I mean, the fact we never ever return non 
> existing rows should hint that it's not what it does but well...) while we're 
> just filtering empty "resultSet rows".
> Of course, if there is a pre-existing SQL syntax for that, it's even better, 
> though a very quick search didn't turn anything. Other suggestions welcome 
> too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to