I wonder how often it is that users will apply the constraints on tables with data while they know their data is probably not compliant with the constraint configuration. I humbly think that people are aware of this in advance and what usually happens is that there is some kind of a job which consolidates the data (or migrates them to a new table) before admins put a "lid" on that so moving forward nobody puts there anything which would violate it.
I probably have not kept myself up to date with the discussion but I was thinking that constraints are effectively there just on the write path. Whatever is read is not a job of a constraint to refuse to return. On Tue, Jun 25, 2024 at 9:57 PM Dinesh Joshi <djo...@apache.org> wrote: > Abe, that's a good point. We need to call out distinct use-cases here. > When a fresh cluster is set up with constraints we don't have any issues > because the data written and read back is going to be compliant to the > constraint(s). For existing data in a cluster where new constraints are > applied or existing constraints changed in such a way that may render > existing data unreadable, we need a good user experience. This is what I > propose – > > 1. When a constraint is added or changed in such a way that existing data > could be rendered unreadable, we should warn the user. > > 2. Give the user a choice of whether it is ok for the data to be rendered > unreadable and an error is issued or a warning should be issued when the > read violates the constraint but data is still readable. New data going in > will meet the constraint but old data would need to be rewritten for > the application to make it compliant. > > With this approach the application developer can decide what is right for > their particular use-case. In many cases the application developer may > decide to rewrite the data when they see a warning. > > > On Tue, Jun 25, 2024 at 12:46 PM Abe Ratnofsky <a...@aber.io> wrote: > >> If we're going to introduce a feature that looks like SQL constraints, we >> should make sure it's "reasonably" compliant. In particular, we should >> avoid situations where a user creates a constraint, writes some data, then >> reads data that violates that constraint, unless they've expressed that >> violations on read would be acceptable. >> >> For Postgres, when adding a new constraint you can specify NOT VALID to >> avoid scanning all existing relevant data[1]. If we want to avoid >> scan-on-DDL, this tradeoff needs to be made clear to a user. >> >> As we've already discussed, constraints must deal with operations that >> appear within limits on the write path, but once reconciled on read or >> during compaction can lead to a violation. Adding to non-frozen collections >> is one example. Expecting users to understand the write path for >> collections feels unrealistic to me; I wonder if we should express in the >> constraint itself that it only applies during write. >> >> Anything that uses "nodetool import" (including cassandra-analytics) >> could theoretically push constraint-violating mutations to a table. We >> could update import to scan table contents first, or add a flag to trust >> the data in imported SSTables and make cassandra-analytics executors aware >> of table-level constraints. >> >> Some client implementations read the system_schema tables to build their >> object mappers, I'd like to confirm that nothing will require clients to be >> aware of these new schema constructs. >> >> Overall, I'm supportive of the distinctions discussed between constraints >> and guardrails and like the direction this is heading; I'd just like to >> make sure the more detailed semantics aren't confusing or misleading for >> our users, and semantics are much harder to change in the future. >> >> [1]: https://www.postgresql.org/docs/current/sql-altertable.html >> >>