[jira] [Commented] (CASSANDRA-15803) Separate out allow filtering scanning through a partition versus scanning over the table

Jira Thu, 27 Apr 2023 04:54:05 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717146#comment-17717146
 ]


Andres de la Peña commented on CASSANDRA-15803:
-----------------------------------------------

I think currently AF is required for any query that returns less rows than 
those that are retrieved from the storage engine. Under that logic, the 
aforementioned example is correct in requiring AF:
{code:java}
create table ks.tb (id int, cl1 int, cl2 int, col1 int, primary key ((id), cl1, 
cl2))
select * from ks2.tb where id = 1 and cl1 = 2 and cl2 = 3 and col1 = 4; // 
returns less rows than reads, so it filters
{code}
Changing that would mean altering the semantic of AF.

One might argue that the current semantic of AF is useless in some cases, like 
the one mentioned above. However I tend to think that the current semantic is 
clear and easy to understand, albeit of limited usefulness in some cases. I 
think the question is whether we want to change the current semantic from 
"filter anything" to "filter over a potentially large dataset".

It is worth mentioning that AF is not the only way to prevent massive 
filtering. We also have [read 
thresholds|https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L1712-L1729]
 that would abort queries scanning too much data. That's probably more useful 
in practice, and more accurate. However, it lacks the ability of AF to complain 
even when preparing a query.

The static analysis of the query done by AF is however imprecise, and it tends 
to make AF a quite frustrating requirement. So I'd be happy by just removing AF 
and relying on config properties limiting capabilities and read thresholds 
aborting queries.
{quote}I would go with a guardrail only if we do not want to make it granular 
per table.
{quote}
Per-table guardrails sounds like an interesting idea. Those guardrails could be 
shipped as table properties limiting capabilities. Efforts on that front would 
probably have more potential for reutilization than extending the CQL grammar 
with {{{}WITHIN PARTITION{}}}.
{quote}The question is what do you do with index queries that filter across 
multiple rows? Do you consider it as equivalent to a partition scan?
{quote}
That's indeed a tricky question. Normally {{WITHIN PARTITION}} would assume 
filtering on a single partition on a single partition. An index query however 
would do filtering on a single partition on each node in the cluster. And I 
don't think we want to further complicate the grammar with {{ALLOW FILTERING 
WITHIN NODE WITHIN PARTITION}}, or something like that.

> Separate out allow filtering scanning through a partition versus scanning 
> over the table
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15803
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15803
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL/Syntax
>            Reporter: Jeremy Hanna
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>
> Currently allow filtering can mean two things in the spirit of "avoid 
> operations that don't seek to a specific row or sequential rows of data."  
> First, it can mean scanning across the entire table to meet the criteria of 
> the query.  That's almost always a bad thing and should be discouraged or 
> disabled (see CASSANDRA-8303).  Second, it can mean filtering within a 
> specific partition.  For example, in a query you could specify the full 
> partition key and if you specify a criterion on a non-key field, it requires 
> allow filtering.
> The second reason to require allow filtering is significantly less work to 
> scan through a partition.  It is still extra work over seeking to a specific 
> row and getting N sequential rows though.  So while an application developer 
> and/or operator needs to be cautious about this second type, it's not 
> necessarily a bad thing, depending on the table and the use case.
> I propose that we separate the way to specify allow filtering across an 
> entire table from specifying allow filtering across a partition in a 
> backwards compatible way.  One idea that was brought up in Slack in the 
> cassandra-dev room was to have allow filtering mean the superset - scanning 
> across the table.  Then if you want to specify that you *only* want to scan 
> within a partition you would use something like
> {{ALLOW FILTERING [WITHIN PARTITION]}}
> So it will succeed if you specify non-key criteria within a single partition, 
> but fail with a message to say it requires the full allow filtering.  This 
> would allow for a backwards compatible full allow filtering while allowing a 
> user to specify that they want to just scan within a partition, but error out 
> if trying to scan a full table.
> This is potentially also related to the capability limitation framework by 
> which operators could more granularly specify what features are allowed or 
> disallowed per user, discussed in CASSANDRA-8303.  This way an operator could 
> disallow the more general allow filtering while allowing the partition scan 
> (or disallow them both at their discretion).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15803) Separate out allow filtering scanning through a partition versus scanning over the table

Reply via email to