On the Analytics side, as long as the CQLSSTableWriter understands and enforces the constraints (which it should be able to , given we provide the table schema) we should be good to go. We should try hard to avoid scanning the data on import, as the Analytics library does a bunch of things to push that kind of logic and CPU + I/O work off to the Spark executors that write the sstables, and reading the whole SSTable on import can drastically slow down that process.
I agree warning the users in docs that we don’t scan the existing data for data that violates constraints if the table wasn’t create with them is important, but I don’t think it would be feasible to do scan-on-DDL change. Could we only support collection-level constraints on frozen lists/sets/maps, as that way the end user would have to be aware of the current size of the collection? Doug > On Jun 25, 2024, at 2:27 PM, Abe Ratnofsky <a...@aber.io> wrote: > > If we're going to introduce a feature that looks like SQL constraints, we > should make sure it's "reasonably" compliant. In particular, we should avoid > situations where a user creates a constraint, writes some data, then reads > data that violates that constraint, unless they've expressed that violations > on read would be acceptable. > > For Postgres, when adding a new constraint you can specify NOT VALID to avoid > scanning all existing relevant data[1]. If we want to avoid scan-on-DDL, this > tradeoff needs to be made clear to a user. > > As we've already discussed, constraints must deal with operations that appear > within limits on the write path, but once reconciled on read or during > compaction can lead to a violation. Adding to non-frozen collections is one > example. Expecting users to understand the write path for collections feels > unrealistic to me; I wonder if we should express in the constraint itself > that it only applies during write. > > Anything that uses "nodetool import" (including cassandra-analytics) could > theoretically push constraint-violating mutations to a table. We could update > import to scan table contents first, or add a flag to trust the data in > imported SSTables and make cassandra-analytics executors aware of table-level > constraints. > > Some client implementations read the system_schema tables to build their > object mappers, I'd like to confirm that nothing will require clients to be > aware of these new schema constructs. > > Overall, I'm supportive of the distinctions discussed between constraints and > guardrails and like the direction this is heading; I'd just like to make sure > the more detailed semantics aren't confusing or misleading for our users, and > semantics are much harder to change in the future. > > [1]: https://www.postgresql.org/docs/current/sql-altertable.html >