Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Štefan Miklošovič Tue, 12 Aug 2025 00:14:28 -0700

I like the idea of COMMENT ON and alike from PG! Yes, great stuff, as we do
not invent anything custom and we will be as close as possible to industry
standard.


So, if I understand this correctly, on COMMENT ON, we would save each
comment to a dedicated table. Then on DESCRIBE, we would "enrich" the CQL
element we are describing with commentary, if any, from that comment table,
correct?

I, in general, support this idea, but as usual the devil is in the details.
I am just genuinely curious how this would work in practice.


If we go with COMMENT ON, is this going to be stored to TCM or not?


If the answer is yes, then it is way more simpler, because then this
commentary would be dispersed by the means of TCM and each node would apply
this transformation locally to system_schema.annotations.

If the answer is no and if there is a cluster and we do COMMENT ON, then
this comment has to be saved to a table. If we rule out TCM as a vehicle
for the dispersion of these comments, that comment table has to be
distributed / replicated, correct? I do not think that we can create that
table under system_schema then, as that is on LocalStrategy and all
modifications to that are, as I understand it, done via TCM?

Hence, I guess the better place for that is under system_distributed? That
means that if somebody changes that keyspace to NTS or nodes are not
available, we will not be able to create any commentary.

Also, if we remove / alter anything, like dropping a keyspace, table,
index, removing column etc ... all these changes would need to also remove
respective comments from that table etc etc.

For these reasons, I think that having dedicated system_schema.annotations
table while interacting with it via COMMENT ON to be "PG-compatible" so
people can query that table directly, and backing COMMENT ON by TCM by
having it as another transformation (as COMMENT ON is inherently part of
the schema) is the best way to do this.

On Mon, Aug 11, 2025 at 10:55 PM Patrick McFadin <[email protected]> wrote:

> One (of many) reasons I'm advocating we migrate away from CQL. It served a
> purpose at the time, but this project is evolving and this to me seems like
> the logical next iteration. The Cassandra project has built it's
> reputation on what it can do, not clever syntax design. ;)
>
> Patrick
>
> On Mon, Aug 11, 2025 at 1:51 PM Yifan Cai <[email protected]> wrote:
>
>> The reasonings on operator and LLM familiarity are spot on.
>>
>> I have experimented with LLM generated queries. It typically does a
>> noticeably better job on SQL than CQL.
>>
>> - Yifan
>>
>> On Mon, Aug 11, 2025 at 1:44 PM Patrick McFadin <[email protected]>
>> wrote:
>>
>>> I really love this CEP.  +1 on the goal.
>>>
>>> As you've already seen, I've been advocating to improve our syntax
>>> ergonomics towards more mainstream SQL and avoiding new/custom syntax.  I
>>> would suggest the following changes towards that goal:
>>>  - Reuse PG-shaped DDL. Keep human text in COMMENT ON[1] (map existing
>>> table comments to that). For structured tags, mirror SECURITY LABEL[2]:
>>> SECURITY LABEL FOR <provider> ON <object> IS '<text>';
>>>
>>> - Allow multiple providers per object. Store the value as text in v1
>>> (JSON or key/val later if we want), which avoids inventing new inline @
>>> syntax.
>>>
>>>  - Avoid new grammar in CREATE/ALTER. Skipping inline @PII keeps schemas
>>> readable and the grammar simple. Tools can issue COMMENT ON/SECURITY LABEL
>>> right after DDL, like PG users do today.
>>>
>>>  - Names & built-ins. Case-insensitive provider names with canonical
>>> lowercase. No separate @Description type. COMMENT ON already covers that
>>> use case cleanly.
>>>
>>>  - Introspection by query and by DESC. Keep annotations visible in
>>> DESCRIBE, but also expose a single system_schema.annotations view
>>> (provider, object_type, object_name, sub_name, value) so folks can get all
>>> annotations for a table. Example: “find all columns labeled PII,” etc.
>>>
>>> Why PG-like? Besides operator familiarity, there’s far more training
>>> data and tooling around COMMENT ON/SECURITY LABEL than around bespoke
>>> @annotation syntax. Sticking to that shape reduces LLM/tool friction and
>>> avoids teaching the world a new grammar. This has been a huge challenge for
>>> Cassandra work with LLMs as models tend to drift towards PG SQL in CQL
>>> often. (No Claude, JOIN is not a keyword in Cassandra)
>>>
>>> If this direction sounds good, happy to help update the CEP text and
>>> examples.
>>>
>>> Patrick
>>>
>>> 1: COMMENT ON docs
>>> https://www.postgresql.org/docs/current/sql-comment.html
>>> 2: SECURITY LABEL docs
>>> https://www.postgresql.org/docs/current/sql-security-label.html
>>>
>>>
>>> On Mon, Aug 11, 2025 at 10:18 AM Yifan Cai <[email protected]> wrote:
>>>
>>>> IMO, the full schema or table schema output already makes it
>>>> possible to filter the fields (not limited to columns) that are using
>>>> certain annotations, relatively easily. Grepping or parsing, whichever is
>>>> more suitable for the scenarios; consumers make the call.
>>>> There is not much added value by providing such a dedicated query,
>>>> however, adding quite a lot of complexity in the design of this CEP. Please
>>>> correct me if I have the wrong understanding of the queries.
>>>>
>>>> Another reason for preferring the existing "DESCRIBE" statements is the
>>>> gen-AI enrichment mentioned in the CEP. We most likely want to feed the LLM
>>>> the full (table) schema.
>>>>
>>>> The primary goal is to enrich the schema with annotations. Through the
>>>> discussion thread, we will find out whether there is enough motivation to
>>>> support such queries to filter by annotation. I appreciate that you brought
>>>> up the idea.
>>>>
>>>> Although we are not at the stage of talking about the implementation,
>>>> just sharing my thoughts a bit, I am thinking of the approach (1) that
>>>> Stefan mentioned.
>>>>
>>>> - Yifan
>>>>
>>>> On Mon, Aug 11, 2025 at 6:31 AM Francisco Guerrero <[email protected]>
>>>> wrote:
>>>>
>>>>> Another interesting query would be to retrieve all the fields
>>>>> annotated with PII
>>>>> for example.
>>>>>
>>>>> On 2025/08/11 01:01:21 Yifan Cai wrote:
>>>>> > >
>>>>> > > Will there be an option to do a SELECT query to read all the
>>>>> annotations
>>>>> > > of a table?
>>>>> >
>>>>> >
>>>>> > It is an interesting question! Would you mind sharing an example of
>>>>> the
>>>>> > output you'd expect from a query like *"SELECT * FROM
>>>>> > system_schema.annotations where keyspace_name=<> and
>>>>> table_name=<>"*? I am
>>>>> > curious how that might differ from what we get when running "DESC
>>>>> TABLE".
>>>>> >
>>>>> > - Yifan
>>>>> >
>>>>> > On Sat, Aug 9, 2025 at 9:43 AM Jaydeep Chovatia <
>>>>> [email protected]>
>>>>> > wrote:
>>>>> >
>>>>> > > >we could explore enriching the syntax with DESCRIBE
>>>>> > >
>>>>> > > Will there be an option to do a SELECT query to read all the
>>>>> annotations
>>>>> > > of a table? Something like *"SELECT * FROM
>>>>> system_schema.annotations
>>>>> > > where keyspace_name=<> and table_name=<>"*
>>>>> > > It would be helpful to have a structured CQL query on top of
>>>>> printing the
>>>>> > > annotations through DESC so that the information can be consumed
>>>>> easily.
>>>>> > >
>>>>> > > Jaydeep
>>>>> > >
>>>>> > > On Fri, Aug 8, 2025 at 11:03 AM Jyothsna Konisa <
>>>>> [email protected]>
>>>>> > > wrote:
>>>>> > >
>>>>> > >> Thanks, Joel, for the positive response.
>>>>> > >>
>>>>> > >> 1. User-defined vs. pre-defined annotation types
>>>>> > >>
>>>>> > >> We'd like to have one predefined annotation, Description, but
>>>>> also give
>>>>> > >> users the flexibility to create new ones. If a user feels that a
>>>>> custom
>>>>> > >> annotation like @Desc suits their use case, they should be
>>>>> allowed to use
>>>>> > >> it, as these elements are purely descriptive and have no actions
>>>>> associated
>>>>> > >> with them.
>>>>> > >>
>>>>> > >> 2. Syntactically, is it worth considering other alternatives?
>>>>> > >>
>>>>> > >> You're concerned that having several annotations on multiple
>>>>> columns
>>>>> > >> could make schemas difficult to read. For now, we can have
>>>>> annotations
>>>>> > >> printed as part of DESCRIBE statements. If there's a strong need
>>>>> to
>>>>> > >> suppress annotations for readability, we could explore enriching
>>>>> the syntax
>>>>> > >> with DESCRIBE [FULL] SCHEMA [WITH ANNOTATIONS], similar to the
>>>>> existing
>>>>> > >> DESCRIBE [FULL] SCHEMA.
>>>>> > >>
>>>>> > >> Thanks,
>>>>> > >> Jyothsna
>>>>> > >>
>>>>> > >> On Fri, Aug 8, 2025 at 10:56 AM Jyothsna Konisa <
>>>>> [email protected]>
>>>>> > >> wrote:
>>>>> > >>
>>>>> > >>> Thanks, Stefan, for your feedback!
>>>>> > >>>
>>>>> > >>> To answer your questions,
>>>>> > >>>
>>>>> > >>> 1. I agree; annotations can optionally take arguments, and if an
>>>>> > >>> annotation doesn't have an argument, we can skip the arguments
>>>>> in the
>>>>> > >>> "DESCRIBE" statement's output.
>>>>> > >>>
>>>>> > >>> 2. Good point. We originally considered using "ANNOTATED WITH"
>>>>> but found
>>>>> > >>> it too verbose. As an alternative, we proposed using "@"
>>>>> preceding the
>>>>> > >>> annotation to signal it to the parser. We are open to using an
>>>>> explicit
>>>>> > >>> phrase like "ANNOTATED WITH" if you think it would make the code
>>>>> more
>>>>> > >>> readable.
>>>>> > >>>
>>>>> > >>> A full example of annotations along with constraints and masking
>>>>> could
>>>>> > >>> be:
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> CREATE TABLE test_ks.test_table (
>>>>> > >>>     id int PRIMARY KEY,
>>>>> > >>>     col2 int CHECK col2 > 0 ANNOTATED WITH @PII AND
>>>>> @DESCRIPTION('this
>>>>> > >>> is column col2') MASKED WITH default()
>>>>> > >>> );
>>>>> > >>>
>>>>> > >>> OR
>>>>> > >>>
>>>>> > >>> CREATE TABLE test_ks.test_table (
>>>>> > >>>     id int PRIMARY KEY,
>>>>> > >>>     col2 int CHECK col2 > 0 @PII AND @DESCRIPTION('this is
>>>>> column col2')
>>>>> > >>> MASKED WITH default()
>>>>> > >>> );
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> 3. We do not have a prototype yet, but I think we will have to
>>>>> introduce
>>>>> > >>> new parsing branch for annotations at the table level
>>>>> > >>>
>>>>> > >>> I hope I answered all your questions!
>>>>> > >>>
>>>>> > >>> - Jyothsna
>>>>> > >>>
>>>>> > >>> On Thu, Aug 7, 2025 at 11:36 AM Joel Shepherd <
>>>>> [email protected]>
>>>>> > >>> wrote:
>>>>> > >>>
>>>>> > >>>> I like the aim of the CEP. Completely onboard with the idea
>>>>> that GenAI
>>>>> > >>>> tooling works better when you can provide it useful context
>>>>> about the data
>>>>> > >>>> it is working with. An organization I worked with in the past
>>>>> had a lot of
>>>>> > >>>> good results with marking up API models (not DB schemas, but
>>>>> similar idea)
>>>>> > >>>> with authorization-related annotations and using those to drive
>>>>> policy
>>>>> > >>>> linters and end-user interfaces. So, sold on the value of the
>>>>> capability.
>>>>> > >>>>
>>>>> > >>>> Two things I'm less sure of:
>>>>> > >>>>
>>>>> > >>>> 1) User-defined vs pre-defined annotation types: I appreciate
>>>>> the
>>>>> > >>>> flexibility that user-defined annotations appears to give, but
>>>>> it adds
>>>>> > >>>> extra room for error. E.g. if annotation names are
>>>>> case-sensitive, do I
>>>>> > >>>> (the user) have to actively prevent creation of @description?
>>>>> Or, police
>>>>> > >>>> the accidental creation of alternative names like @Desc? If the
>>>>> community
>>>>> > >>>> settled on a small, fixed set of supported annotations, so
>>>>> Cassandra itself
>>>>> > >>>> was authoritative for valid annotation names, would make the
>>>>> feature a lot
>>>>> > >>>> less valuable, or prevent offering user-defined annotations in
>>>>> the future?
>>>>> > >>>>
>>>>> > >>>> 2) Syntactically, is it worth considering other alternatives? I
>>>>> was
>>>>> > >>>> trying to imagine a CREATE TABLE statement marked up with two
>>>>> or three
>>>>> > >>>> types of column-level annotations, and my sense is that it
>>>>> could get hard
>>>>> > >>>> to read quickly. Is it worth considering Javadoc-style
>>>>> annotations in
>>>>> > >>>> schema comments instead? I think in today's world that means
>>>>> that they
>>>>> > >>>> would not be accessible via CQL/Cassandra (CQL comments are not
>>>>> persisted
>>>>> > >>>> as part of the schema, correct?) but they could be accessible
>>>>> to other
>>>>> > >>>> schema-processing tools and IMO be a more readable syntax. It'd
>>>>> be good to
>>>>> > >>>> work through a couple use-cases for actually using the data
>>>>> provided by the
>>>>> > >>>> annotations and get a sense of whether making them first-class
>>>>> entities in
>>>>> > >>>> CQL is necessary for getting most of the value from them.
>>>>> > >>>>
>>>>> > >>>> Thanks -- Joel.
>>>>> > >>>> On 8/6/2025 6:59 PM, Jyothsna Konisa wrote:
>>>>> > >>>>
>>>>> > >>>> Sorry for the incorrect editable link, here is the updated link
>>>>> to the CEP
>>>>> > >>>> 52: Schema Annotations for ApacheCassandra
>>>>> > >>>> <
>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP+52%3A+Schema+Annotations+for+ApacheCassandra
>>>>> >
>>>>> > >>>>
>>>>> > >>>> On Wed, Aug 6, 2025 at 4:26 PM Jyothsna Konisa <
>>>>> [email protected]>
>>>>> > >>>> wrote:
>>>>> > >>>>
>>>>> > >>>>> Hello Everyone!
>>>>> > >>>>>
>>>>> > >>>>> We would like to propose CEP 52: Schema Annotations for
>>>>> > >>>>> ApacheCassandra
>>>>> > >>>>> <
>>>>> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=373887528&draftShareId=339b7f4e-9bc2-45bd-9a80-b0d4215e3f45&;
>>>>> >
>>>>> > >>>>>
>>>>> > >>>>> This CEP outlines a plan to introduce *Schema Annotations* as
>>>>> a way
>>>>> > >>>>> to add better context to schema elements. We're also proposing
>>>>> a set of new
>>>>> > >>>>> DDL statements to manage these annotations.
>>>>> > >>>>>
>>>>> > >>>>> We believe these annotations will be highly beneficial for
>>>>> several key
>>>>> > >>>>> areas:
>>>>> > >>>>>
>>>>> > >>>>>    -
>>>>> > >>>>>
>>>>> > >>>>>    GenAI Applications: Providing more context to LLMs could
>>>>> > >>>>>    significantly improve the accuracy and relevance of
>>>>> generated content.
>>>>> > >>>>>    -
>>>>> > >>>>>
>>>>> > >>>>>    Data Governance: Annotations can help in enforcing policies
>>>>> using
>>>>> > >>>>>    annotations
>>>>> > >>>>>    -
>>>>> > >>>>>
>>>>> > >>>>>    Compliance: They can be used to track and manage compliance
>>>>> > >>>>>    requirements directly within the schema.
>>>>> > >>>>>
>>>>> > >>>>> We're eager to hear your thoughts and feedback on this
>>>>> proposal.
>>>>> > >>>>> Please keep the discussion within this mailing thread.
>>>>> > >>>>>
>>>>> > >>>>> Thanks for your time and feedback in advance.
>>>>> > >>>>>
>>>>> > >>>>> Best regards,
>>>>> > >>>>>
>>>>> > >>>>> Jyothsna & Yifan
>>>>> > >>>>>
>>>>> > >>>>>
>>>>> > >>>>>
>>>>> > >>>>>
>>>>> >
>>>>>
>>>>

Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Reply via email to