Re: Constraint's "not null" alignment with transactions and their simplification

Jon Haddad Fri, 11 Apr 2025 10:21:02 -0700

Fair question.  I think the only reason to favor JSON over anything else is
it's popularity.


I think if a JSON type is simply a text type with validation, then adding a
simple validator for other formats is probably easy.  We've already got
YAML support in the codebase so adding yaml would be trival.

If you want to take it a step further (like postgres does), then you can
think of the JSON type as validation on top of an arbitrary object with any
number of nested fields.  If you transform the JSON into that object, you
can do queries on it (again, like PG), but you can also support other
structures like yaml, toml, whatever.  Because at the end of the day you're
only storing the binary optimized version of a data structure, and you can
go back and forth between the binary and the format, just like we already
can do with Jackson.

The true power here would being able to put indexes on the fields and have
all the path functionality for searches / updates / etc.

I never loved our existing JSON functionality, it never made sense to me.
This does.  It's all the power of collections + UDTs merged together.

I have no clue how hard it would be to build, but I know it would be
incredibly useful for end users.

Happy to discuss it further but probably best to move to another thread :)

Jon


On Fri, Apr 11, 2025 at 9:54 AM Štefan Miklošovič <smikloso...@apache.org>
wrote:

> Jon,
>
> that dedicated JSON type, I was thinking about this literally yesterday. I
> like the idea. There would be an implicit check on its value with json
> constraint under the hood. But that also got me thinking "why just JSON"?
> As if it was the only format out there ... Probably very widely used and
> the usage of JSON type would be justified but at the same time it starts to
> introduce this small "discrepancy" that we seem to favor one specific type
> / format over the others.
>
> On Fri, Apr 11, 2025 at 6:25 PM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
>> I agree NOT NULL is nicer than check not null.
>>
>> Taking the JSON bit a step further, a dedicated JSON type is better than
>> a check, but that's probably out of scope.
>>
>> For inspiration, I've worked with Postgre's JSONB type in the past, and
>> it was awesome.
>>
>> https://www.postgresql.org/docs/current/functions-json.html
>>
>> On Fri, Apr 11, 2025 at 9:15 AM Benedict <bened...@apache.org> wrote:
>>
>>> [image: elephant.png]
>>>
>>> 5.5. Constraints
>>> <https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-CHECK-CONSTRAINTS>
>>> postgresql.org
>>> <https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-CHECK-CONSTRAINTS>
>>>
>>> <https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-CHECK-CONSTRAINTS>
>>>
>>>
>>> See the note directly above section 5.5.2 - Postgres supports functions
>>> in the CHECK predicate and requires they are pure functions (ie always
>>> yield the same result).
>>>
>>> The point is, CHECK is consistently a predicate defined over a field
>>> expression. If we aren’t matching the feature, why are we matching the
>>> keyword?
>>>
>>> I also disagree that CHECK is a useful prefix to all constraints, and
>>> would prefer consistency with Postgres here that also uses plain NOT NULL
>>> for a NOT NULL constraint.
>>>
>>>
>>> On 11 Apr 2025, at 17:05, Štefan Miklošovič <smikloso...@apache.org>
>>> wrote:
>>>
>>> 
>>> I went through Postgres' docs in a quite elaborate manner and I do not
>>> see any usage of "constraint functions" as we have them. Both MySQL and
>>> Postgres seem to only provide simple predicates (using relational
>>> operators) and "functions" as we have them (e.g. json / length) are not
>>> supported. So in this regard we are providing more than they do. Where have
>>> you taken "is_json(field)" form from? Not saying it's wrong, I am just
>>> curious where this is coming from.
>>>
>>> I do not like that we would have "NOT NULL" without "CHECK". I think we
>>> can go a little bit our own way as we have the comfort of modelling this
>>> from scratch. CQL is already different from SQL as is and I do not think
>>> that trying to follow SQL _orthodoxly_ is absolutely necessary but at the
>>> same time I find it easier and more welcoming for users coming to Cassandra
>>> for the first time to have syntax which is as close as possible to what
>>> they are used to.
>>>
>>> I find having constraints starting with "CHECK" _every time_ consistent.
>>> They do not need to think twice if "check" is going to be there or not. It
>>> is there every time. I do not know why SQL did not do it the same way, most
>>> probably because "NOT NULL" was the first being introduced and "CHECK"
>>> followed afterwards and it was just too late to it consistent.
>>>
>>>
>>>
>>> On Fri, Apr 11, 2025 at 5:33 PM Bernardo Botella <
>>> conta...@bernardobotella.com> wrote:
>>>
>>>> Benedict:
>>>>
>>>> An alternative for that, keeping the CHECK word, would be to change the
>>>> constraint name to IS_JSON. CHECK IS_JSON would read as you intend without
>>>> the need to jump to REQUIRE. I think that’s true for the rest of provided
>>>> constraints as well.
>>>>
>>>> Bernardo
>>>>
>>>>
>>>> On Apr 11, 2025, at 6:02 AM, Benedict <bened...@apache.org> wrote:
>>>>
>>>> We have taken a different approach though, as we do not actually take a
>>>> predicate on the RHS and do not supply the column name. In our examples we
>>>> had eg CHECK JSON, which doesn’t parse unambiguously to a human. The
>>>> equivalent to Postgres would seem to be CHECK is_json(field).
>>>>
>>>> I’m all for following an existing example, but once we decide to
>>>> diverge the justification is gone and we should decide holistically what we
>>>> think is best. So if we want to elide the column entirely and have a list
>>>> of built in restrictions, I’d prefer eg REQUIRE JSON since this parses
>>>> unambiguously to a human, whereas if we want to follow Postgres let’s do
>>>> that but do it but that means eg CHECK is_json(field).
>>>>
>>>> On 11 Apr 2025, at 10:57, Štefan Miklošovič <smikloso...@apache.org>
>>>> wrote:
>>>>
>>>> 
>>>> While modelling that, we followed how it is done in SQL world,
>>>> PostgreSQL as well as MySQL both use CHECK.
>>>>
>>>>
>>>> https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-CHECK-CONSTRAINTS
>>>>
>>>> https://dev.mysql.com/doc/refman/8.4/en/create-table-check-constraints.html
>>>>
>>>> On Fri, Apr 11, 2025 at 10:43 AM Benedict <bened...@apache.org> wrote:
>>>>
>>>>> I would prefer require/expect/is over check
>>>>>
>>>>> On 11 Apr 2025, at 08:05, Štefan Miklošovič <smikloso...@apache.org>
>>>>> wrote:
>>>>>
>>>>> 
>>>>> Yes, you will have it like that :) Thank you for this idea. Great
>>>>> example of cooperation over diverse domains.
>>>>>
>>>>> On Fri, Apr 11, 2025 at 12:29 AM David Capwell <dcapw...@apple.com>
>>>>> wrote:
>>>>>
>>>>>> I am biased but I do prefer
>>>>>>
>>>>>> val3 text CHECK NOT NULL AND JSON AND LENGTH() < 1024
>>>>>>
>>>>>> Here is a similar accord CQL
>>>>>>
>>>>>> BEGIN TRANSACTION
>>>>>>   LET a = (…);
>>>>>>   IF a IS NOT NULL
>>>>>>       AND a.b IS NOT NULL
>>>>>>       AND a.c IS NULL; THEN
>>>>>>     — profit
>>>>>>   END IF
>>>>>> COMMIT TRANSACTION
>>>>>>
>>>>>> On Apr 10, 2025, at 8:46 AM, Yifan Cai <yc25c...@gmail.com> wrote:
>>>>>>
>>>>>> Re: reserved keywords, “check” is currently not, and I don’t think it
>>>>>> needs to be a reserved keyword with the proposal.
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* C. Scott Andreas <sc...@paradoxica.net>
>>>>>> *Sent:* Thursday, April 10, 2025 7:59:35 AM
>>>>>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org>
>>>>>> *Cc:* dev@cassandra.apache.org <dev@cassandra.apache.org>
>>>>>> *Subject:* Re: Constraint's "not null" alignment with transactions
>>>>>> and their simplification
>>>>>>
>>>>>> If the proposal does not introduce “check” as a reserved keyword that
>>>>>> would require quoting in existing DDL/DML, this concern doesn’t apply and
>>>>>> the email below can be ignored. This might be the case if “CHECK NOT 
>>>>>> NULL”
>>>>>> is the full token introduced rather than “CHECK” separately from
>>>>>> constraints that are checked.
>>>>>>
>>>>>> If “check” is introduced as a standalone reserved keyword: my primary
>>>>>> feedback is on the introduction of reserved words in the CQL grammar that
>>>>>> may affect compatibility of existing schemas.
>>>>>>
>>>>>> In the Cassandra 3.x series, several new CQL reserved words were
>>>>>> added (more than necessary) and subsequently backed out, because it
>>>>>> required users to begin quoting schemas and introduced incompatibility
>>>>>> between 3.x and 4.x for queries and DDL that “just worked” before.
>>>>>>
>>>>>> The word “check” is used in many domains (test/evaluation
>>>>>> engineering, finance, business processes, etc) and is likely to be used 
>>>>>> in
>>>>>> user schemas. If the proposal introduces this as a reserved word that 
>>>>>> would
>>>>>> require it to be quoted if used in table or column names, this will 
>>>>>> create
>>>>>> incompatibility for existing user queries on upgrade.
>>>>>>
>>>>>> Otherwise, ignore me. :)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> – Scott
>>>>>>
>>>>>> –––
>>>>>> Mobile
>>>>>>
>>>>>> On Apr 10, 2025, at 7:47 AM, Jon Haddad <j...@rustyrazorblade.com>
>>>>>> wrote:
>>>>>>
>>>>>> 
>>>>>> This looks like a really nice improvement to me.
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 10, 2025 at 7:27 AM Štefan Miklošovič <
>>>>>> smikloso...@apache.org> wrote:
>>>>>>
>>>>>> Recently, David Capwell was commenting on constraints in one of Slack
>>>>>> threads (1) in dev channel and he suggested that the current form of "not
>>>>>> null" constraint we have right now in place, e.g like this
>>>>>>
>>>>>> create table ks.tb (id int primary key, val int check not_null(val));
>>>>>>
>>>>>> could be instead of that form used like this:
>>>>>>
>>>>>> create table ks.tb (id int primary key, val int check not null);
>>>>>>
>>>>>> That is - without the name of a column in the constraint's argument.
>>>>>> The reasoning behind that was that it is not only easier to read but 
>>>>>> there
>>>>>> is also this concept in transactions (cep-15) where there is also "not
>>>>>> null" used in some fashion and it would be nice if this was aligned so a
>>>>>> user does not encounter two usages of "not null"-s which are written down
>>>>>> differently, syntax-wise.
>>>>>>
>>>>>> Could the usage of "not null" in transactions be confirmed?
>>>>>>
>>>>>> This rather innocent suggestion brought an idea to us that
>>>>>> constraints could be quite simplified when it comes to their syntax,
>>>>>> consider this:
>>>>>>
>>>>>> val int check not_null(val)
>>>>>> val text check json(val)
>>>>>> val text check lenght(val) < 1000
>>>>>>
>>>>>> to be used like this:
>>>>>>
>>>>>> val int check not null
>>>>>> val text check json
>>>>>> val text check length() < 1000
>>>>>>
>>>>>> more involved checks like this:
>>>>>>
>>>>>> val text check not_null(val) and json(val) and length(val) < 1000
>>>>>>
>>>>>> might be just simplified to:
>>>>>>
>>>>>> val text check not null and json and length() < 1000
>>>>>>
>>>>>> It almost reads like plain English. Isn't this just easier for an eye?
>>>>>>
>>>>>> The reason we kept the column names in constraint definitions is
>>>>>> that, frankly speaking, we just did not know any better at the time it 
>>>>>> was
>>>>>> about to be implemented. It is a little bit more tricky to be able to use
>>>>>> it without column names because in Parser.g / Antlr we just bound the
>>>>>> grammar around constraints to a column name directly there. When column
>>>>>> names are not going to be there anymore, we need to bind it later in the
>>>>>> code behind the parser in server code. It is doable, it was just about
>>>>>> being a little bit more involved there.
>>>>>>
>>>>>> Also, one reason to keep the name of a column was that we might
>>>>>> specify different columns in a constraint from a column that is defined 
>>>>>> on
>>>>>> to have cross-column constraints but we abandoned this idea altogether 
>>>>>> for
>>>>>> other reasons which rendered the occurrence of a column name in a
>>>>>> constraint definition redundant.
>>>>>>
>>>>>> To have some overview of what would be possible to do with this
>>>>>> proposal:
>>>>>>
>>>>>> val3 text CHECK SOMECONSTRAINT('a');
>>>>>> val3 text CHECK JSON;
>>>>>> val3 text CHECK SOMECONSTRAINT('a') > 1;
>>>>>> val3 text CHECK SOMECONSTRAINT('a', 'b', 'c') > 1;
>>>>>> val3 text CHECK JSON AND LENGTH() < 600;
>>>>>> afternoon time CHECK afternoon >= '12:00:00' AND afternoon =<
>>>>>> '23:59:59';
>>>>>> val3 text CHECK NOT NULL AND JSON AND LENGTH() < 1024
>>>>>>
>>>>>> In addition to the specification of constraints without columns, what
>>>>>> would be possible to do is to also specify arguments to constraints. It 
>>>>>> is
>>>>>> currently not possible and there is no constraint which would accept
>>>>>> arguments to its function but I think that to be as flexible as possible
>>>>>> and prepare for the future, we might implement it as well.
>>>>>>
>>>>>> Constraints in their current form are already usable however I just
>>>>>> think that if we do not simplify, align and extend the syntax right now,
>>>>>> before it is baked in in a release, then we will never do it as it will 
>>>>>> be
>>>>>> quite tricky to extend this without breaking it and maintaining two
>>>>>> grammars at the same time would be very complex if not flat out 
>>>>>> impossible.
>>>>>>
>>>>>> Are you open to the simplification of constraint definitions as
>>>>>> suggested and what is your feedback about that? I already have a working
>>>>>> POC which just needs to be polished and tests fixed to accommodate the 
>>>>>> new
>>>>>> approach.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1742409054164389
>>>>>>
>>>>>>
>>>>>>
>>>>

Re: Constraint's "not null" alignment with transactions and their simplification

Reply via email to