On the Analytics side, as long as the CQLSSTableWriter understands and enforces 
the constraints (which it should be able to , given we provide the table 
schema) we should be good to go. We should try hard to avoid scanning the data 
on import, as the Analytics library does a bunch of things to push that kind of 
logic and CPU + I/O work off to the Spark executors that write the sstables, 
and reading the whole SSTable on import can drastically slow down that process.

I agree warning the users in docs that we don’t scan the existing data for data 
that violates constraints if the table wasn’t create with them is important, 
but I don’t think it would be feasible to do scan-on-DDL change.

Could we only support collection-level constraints on frozen lists/sets/maps, 
as that way the end user would have to be aware of the current size of the 
collection?

Doug

> On Jun 25, 2024, at 2:27 PM, Abe Ratnofsky <a...@aber.io> wrote:
> 
> If we're going to introduce a feature that looks like SQL constraints, we 
> should make sure it's "reasonably" compliant. In particular, we should avoid 
> situations where a user creates a constraint, writes some data, then reads 
> data that violates that constraint, unless they've expressed that violations 
> on read would be acceptable.
> 
> For Postgres, when adding a new constraint you can specify NOT VALID to avoid 
> scanning all existing relevant data[1]. If we want to avoid scan-on-DDL, this 
> tradeoff needs to be made clear to a user.
> 
> As we've already discussed, constraints must deal with operations that appear 
> within limits on the write path, but once reconciled on read or during 
> compaction can lead to a violation. Adding to non-frozen collections is one 
> example. Expecting users to understand the write path for collections feels 
> unrealistic to me; I wonder if we should express in the constraint itself 
> that it only applies during write.
> 
> Anything that uses "nodetool import" (including cassandra-analytics) could 
> theoretically push constraint-violating mutations to a table. We could update 
> import to scan table contents first, or add a flag to trust the data in 
> imported SSTables and make cassandra-analytics executors aware of table-level 
> constraints.
> 
> Some client implementations read the system_schema tables to build their 
> object mappers, I'd like to confirm that nothing will require clients to be 
> aware of these new schema constructs.
> 
> Overall, I'm supportive of the distinctions discussed between constraints and 
> guardrails and like the direction this is heading; I'd just like to make sure 
> the more detailed semantics aren't confusing or misleading for our users, and 
> semantics are much harder to change in the future.
> 
> [1]: https://www.postgresql.org/docs/current/sql-altertable.html
> 

Reply via email to