It can be from not specifying the full primary key including clustering columns or it can be across multiple partitions. There are two scenarios. That’s why I created https://issues.apache.org/jira/browse/CASSANDRA-15803 and why I think it’s relevant for this. 

On Apr 14, 2023, at 9:37 AM, Lorina Poland <lor...@datastax.com> wrote:


Wow, you know, J.D., I've never actually heard ALLOW FILTERING described as you did. Generally, the discussion is always in terms of multiple partitions, probably because that is the situation in which the memory is exceeded. Thanks for that definition. 

Regardless of how this discussion goes, I'll make a ticket to change that doc.

Lorina

On Thu, Apr 13, 2023 at 4:17 AM J. D. Jordan <jeremiah.jor...@gmail.com> wrote:
The documentation is wrong. ALLOW FILTERING has always meant that “rows will need to be materialized in memory and accepted or rejected by a column filter” aka the full primary key was not specified and some other column was specified.  It has never been about multiple partitions.
Basically “will the server need to read from disk more data (possibly a lot more) than will be returned to the client”.
Should we change how that works? Maybe. But let move such discussions to a new thread and keep this one about the CEP proposal.

On Apr 13, 2023, at 6:00 AM, Andrés de la Peña <adelap...@apache.org> wrote:


Indeed requiring AF for "select * from ks.tb where p1 = 1 and c1 = 2 and col2 = 1", where p1 and c1 are all the columns in the primary key, sounds like a bug. 

I think the criterion in the code is that we require AF if there is any column restriction that cannot be processed by the primary key or a secondary index. The error message indeed seems to reject any kind of filtering, independently of primary key filters. We can see this even without defined clustering keys:

CREATE TABLE t (k int PRIMARY KEY, v int);
SELECT * FROM  t WHERE  k = 1 AND v = 1; # requires AF

That clashes with documentation, where it's said that AF is required for filters that require scanning all partitions. If we were to adapt the code to the behaviour described in documentation we shouldn't require AF if there are restrictions specifying a partition key. Or possibly a group of partition keys, if a IN restriction is used. So both within row and within partition filtering wouldn't require AF.

Regarding adding a new ALLOW FILTERING WITHIN PARTITION, I think we could just add a guardrail to directly disallow those queries, without needing to add the WITHIN PARTITION clause to the CQL grammar.

On Thu, 13 Apr 2023 at 11:11, Henrik Ingo <henrik.i...@datastax.com> wrote:


On Thu, Apr 13, 2023 at 10:20 AM Miklosovic, Stefan <stefan.mikloso...@netapp.com> wrote:
Somebody correct me if I am wrong but "partition key" itself is not enough (primary keys = partition keys + clustering columns). It will require ALLOW FILTERING when clustering columns are not specified either.

create table ks.tb (p1 int, c1 int, col1 int, col2 int, primary key (p1, c1));
select * from ks.tb where p1 = 1 and col1 = 2;     // this will require allow filtering

The documentation seems to omit this fact.

It does seem so.

That said, personally I was assuming, and would still argue it's the optimal choice, that the documentation was right and reality is wrong.

If there is a partition key, then the query can avoid scanning the entire table, across all nodes, potentially petabytes.

If a query specifies a partition key but not the full clustering key, of course there will be some scanning needed, but this is marginal compared to the need to scan the entire table. Even in the worst case, a partition with 2 billion cells, we are talking about seconds to filter the result from the single partition.

> Aha I get what you all mean:

No, I actually think both are unnecessary. But yeah, certainly this latter case is a bug?

henrik

--

Henrik Ingo

c. +358 40 569 7354 

w. www.datastax.com

     


Reply via email to