Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Jyothsna Konisa Tue, 12 Aug 2025 12:52:48 -0700

Hi Stefan, Patrick, and everyone,

Thank you all for your valuable feedback and suggestions. I've consolidated
the key points and wanted to share our thinking on a path forward.



*Regarding the PostgreSQL-style Syntax (COMMENT ON & SECURITY LABEL)*

We agree with the consensus that adopting PostgreSQL-style syntax is the
most promising approach for the following reasons, which were
well-articulated in the thread:

- Avoids introducing new Syntax

- Keeps CQL closer to mainstream SQL

- More SQL data for LLM training



*Storing Annotations*
We propose to store these comments as part of the schema element's
metadata, which will be persisted to TCM.

Regarding the discussion about a separate table for annotations: We want to
present an alternative to store annotations/comments in a virtual table. We
can address this during implementation or as a follow-up to this CEP.

*Impact on DESCRIBE Statements*

Adopting the COMMENT ON syntax will require some changes to how the schema
is displayed.

To maintain consistency and ensure the schema can be fully reproduced, the
COMMENT ON statements must be included in the output of DESCRIBE TABLE. We
propose that the output for DESCRIBE TABLE would look something like this:


// Comment creation & DESC table output
CREATE TABLE ks.tb
(
    id int PRIMARY KEY,
    val text
)

COMMENT ON COLUMN ks.tb.val IS 'credit card number'
SECURITY LABEL ON COLUMN ks.tb.val IS 'PII'


Including the comment information within the CREATE TABLE statement itself
might be redundant and displaying them as separate COMMENT ON statements
might be better.

Thanks
Jyothsna

On Tue, Aug 12, 2025 at 9:31 AM Štefan Miklošovič <[email protected]>
wrote:

> One more point I would like to add. If we enrich the output with comments,
> I think that seeing comments should be only default if I can take what
> DESCRIBE prints and I can copy it as-is and create tables from it.  Very
> often, DESCRIBE acts as something like "I will copy this schema here so I
> can reconstruct it later". So I would expect that, by default, what
> DESCRIBE gives is "reconstructable". I think there are a lot of tests
> already which tests what DESCRIBE prints can be reconstructed and this
> would need to be preserved.
>
> We might still do "DESCRIBE ks.tb" without comments / annotations and then
> "DESCRIBE ks.tb WITH COMMENTS / ANNOTATIONS" to print them.
>
> If we put comments on this it is "reconstructable by copy-pasting" as well:
>
> create table ks.tb
> (
>     -- my primary key column
>     id int primary key,
>     -- this is my value
>     val text
> )
>
> however this is not
>
> create table ks.tb
> (
>     /**
>      my primary key column
>     */
>     id int primary key,
>     val text
> )
>
> you got me ...
>
> Also, if we start to automatically enrich DESCRIBE output, it would be
> very nice if this was digestible by previous versions. Because if I copy
> DESCRIBE output in 5.1 with @PII then I can not just apply that to 5.0
> where that concept is not known yet. However plain comments do work in
> previous versions as well.
>
> For this reason I would not make annotations visible by default, I would
> opt-in by WITH COMMENTS / WITH ANNOTATIONS only and keep the current output
> as is.
>
>
> On Tue, Aug 12, 2025 at 10:56 AM Mick <[email protected]> wrote:
>
>> a point of order and a reminder: aside from suggestions that the CEP
>> author is free to adopt or not, anything that's assuming to steer what the
>> CEP should be should be accompanied with the willingness to commit in
>> helping making it happen.  we want to work as a meritocracy: those that
>> lead the work have the say, and blocking their chosen approach against
>> their wishes is only on clear technical reasons.  API designs (CQL
>> additions) always needs to be chosen and evolved carefully, and every CEP
>> proposed should be open to that being naturally part of its discussion
>> pre-vote.
>>
>> following the PG approach does make a lot of sense.
>> what are your thoughts on it Jyothsna & Yifan ?
>>
>>
>>
>> > On 12 Aug 2025, at 09:14, Štefan Miklošovič <[email protected]>
>> wrote:
>> >
>> > I like the idea of COMMENT ON and alike from PG! Yes, great stuff, as
>> we do not invent anything custom and we will be as close as possible to
>> industry standard.
>> >
>> > So, if I understand this correctly, on COMMENT ON, we would save each
>> comment to a dedicated table. Then on DESCRIBE, we would "enrich" the CQL
>> element we are describing with commentary, if any, from that comment table,
>> correct?
>> >
>> > I, in general, support this idea, but as usual the devil is in the
>> details. I am just genuinely curious how this would work in practice.
>> >
>> >
>> > If we go with COMMENT ON, is this going to be stored to TCM or not?
>> >
>> >
>> > If the answer is yes, then it is way more simpler, because then this
>> commentary would be dispersed by the means of TCM and each node would apply
>> this transformation locally to system_schema.annotations.
>> >
>> > If the answer is no and if there is a cluster and we do COMMENT ON,
>> then this comment has to be saved to a table. If we rule out TCM as a
>> vehicle for the dispersion of these comments, that comment table has to be
>> distributed / replicated, correct? I do not think that we can create that
>> table under system_schema then, as that is on LocalStrategy and all
>> modifications to that are, as I understand it, done via TCM?
>> >
>> > Hence, I guess the better place for that is under system_distributed?
>> That means that if somebody changes that keyspace to NTS or nodes are not
>> available, we will not be able to create any commentary.
>> >
>> > Also, if we remove / alter anything, like dropping a keyspace, table,
>> index, removing column etc ... all these changes would need to also remove
>> respective comments from that table etc etc.
>> >
>> > For these reasons, I think that having dedicated
>> system_schema.annotations table while interacting with it via COMMENT ON to
>> be "PG-compatible" so people can query that table directly, and backing
>> COMMENT ON by TCM by having it as another transformation (as COMMENT ON is
>> inherently part of the schema) is the best way to do this.
>> >
>> > On Mon, Aug 11, 2025 at 10:55 PM Patrick McFadin <[email protected]>
>> wrote:
>> > One (of many) reasons I'm advocating we migrate away from CQL. It
>> served a purpose at the time, but this project is evolving and this to me
>> seems like the logical next iteration. The Cassandra project has built it's
>> reputation on what it can do, not clever syntax design. ;)
>> >
>> > Patrick
>> >
>> > On Mon, Aug 11, 2025 at 1:51 PM Yifan Cai <[email protected]> wrote:
>> > The reasonings on operator and LLM familiarity are spot on.
>> >
>> > I have experimented with LLM generated queries. It typically does a
>> noticeably better job on SQL than CQL.
>> >
>> > - Yifan
>> >
>> > On Mon, Aug 11, 2025 at 1:44 PM Patrick McFadin <[email protected]>
>> wrote:
>> > I really love this CEP.  +1 on the goal.
>> >
>> > As you've already seen, I've been advocating to improve our syntax
>> ergonomics towards more mainstream SQL and avoiding new/custom syntax.  I
>> would suggest the following changes towards that goal:
>> >  - Reuse PG-shaped DDL. Keep human text in COMMENT ON[1] (map existing
>> table comments to that). For structured tags, mirror SECURITY LABEL[2]:
>> > SECURITY LABEL FOR <provider> ON <object> IS '<text>';
>> >
>> > - Allow multiple providers per object. Store the value as text in v1
>> (JSON or key/val later if we want), which avoids inventing new inline @
>> syntax.
>> >
>> >  - Avoid new grammar in CREATE/ALTER. Skipping inline @PII keeps
>> schemas readable and the grammar simple. Tools can issue COMMENT
>> ON/SECURITY LABEL right after DDL, like PG users do today.
>> >
>> >  - Names & built-ins. Case-insensitive provider names with canonical
>> lowercase. No separate @Description type. COMMENT ON already covers that
>> use case cleanly.
>> >
>> >  - Introspection by query and by DESC. Keep annotations visible in
>> DESCRIBE, but also expose a single system_schema.annotations view
>> (provider, object_type, object_name, sub_name, value) so folks can get all
>> annotations for a table. Example: “find all columns labeled PII,” etc.
>> >
>> > Why PG-like? Besides operator familiarity, there’s far more training
>> data and tooling around COMMENT ON/SECURITY LABEL than around bespoke
>> @annotation syntax. Sticking to that shape reduces LLM/tool friction and
>> avoids teaching the world a new grammar. This has been a huge challenge for
>> Cassandra work with LLMs as models tend to drift towards PG SQL in CQL
>> often. (No Claude, JOIN is not a keyword in Cassandra)
>> >
>> > If this direction sounds good, happy to help update the CEP text and
>> examples.
>> >
>> > Patrick
>> >
>> > 1: COMMENT ON docs
>> https://www.postgresql.org/docs/current/sql-comment.html
>> > 2: SECURITY LABEL docs
>> https://www.postgresql.org/docs/current/sql-security-label.html
>> >
>> >
>> > On Mon, Aug 11, 2025 at 10:18 AM Yifan Cai <[email protected]> wrote:
>> > IMO, the full schema or table schema output already makes it possible
>> to filter the fields (not limited to columns) that are using certain
>> annotations, relatively easily. Grepping or parsing, whichever is more
>> suitable for the scenarios; consumers make the call.
>> > There is not much added value by providing such a dedicated query,
>> however, adding quite a lot of complexity in the design of this CEP. Please
>> correct me if I have the wrong understanding of the queries.
>> >
>> > Another reason for preferring the existing "DESCRIBE" statements is the
>> gen-AI enrichment mentioned in the CEP. We most likely want to feed the LLM
>> the full (table) schema.
>> >
>> > The primary goal is to enrich the schema with annotations. Through the
>> discussion thread, we will find out whether there is enough motivation to
>> support such queries to filter by annotation. I appreciate that you brought
>> up the idea.
>> >
>> > Although we are not at the stage of talking about the implementation,
>> just sharing my thoughts a bit, I am thinking of the approach (1) that
>> Stefan mentioned.
>> >
>> > - Yifan
>> >
>> > On Mon, Aug 11, 2025 at 6:31 AM Francisco Guerrero <[email protected]>
>> wrote:
>> > Another interesting query would be to retrieve all the fields annotated
>> with PII
>> > for example.
>> >
>> > On 2025/08/11 01:01:21 Yifan Cai wrote:
>> > > >
>> > > > Will there be an option to do a SELECT query to read all the
>> annotations
>> > > > of a table?
>> > >
>> > >
>> > > It is an interesting question! Would you mind sharing an example of
>> the
>> > > output you'd expect from a query like *"SELECT * FROM
>> > > system_schema.annotations where keyspace_name=<> and table_name=<>"*?
>> I am
>> > > curious how that might differ from what we get when running "DESC
>> TABLE".
>> > >
>> > > - Yifan
>> > >
>> > > On Sat, Aug 9, 2025 at 9:43 AM Jaydeep Chovatia <
>> [email protected]>
>> > > wrote:
>> > >
>> > > > >we could explore enriching the syntax with DESCRIBE
>> > > >
>> > > > Will there be an option to do a SELECT query to read all the
>> annotations
>> > > > of a table? Something like *"SELECT * FROM system_schema.annotations
>> > > > where keyspace_name=<> and table_name=<>"*
>> > > > It would be helpful to have a structured CQL query on top of
>> printing the
>> > > > annotations through DESC so that the information can be consumed
>> easily.
>> > > >
>> > > > Jaydeep
>> > > >
>> > > > On Fri, Aug 8, 2025 at 11:03 AM Jyothsna Konisa <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > >> Thanks, Joel, for the positive response.
>> > > >>
>> > > >> 1. User-defined vs. pre-defined annotation types
>> > > >>
>> > > >> We'd like to have one predefined annotation, Description, but also
>> give
>> > > >> users the flexibility to create new ones. If a user feels that a
>> custom
>> > > >> annotation like @Desc suits their use case, they should be allowed
>> to use
>> > > >> it, as these elements are purely descriptive and have no actions
>> associated
>> > > >> with them.
>> > > >>
>> > > >> 2. Syntactically, is it worth considering other alternatives?
>> > > >>
>> > > >> You're concerned that having several annotations on multiple
>> columns
>> > > >> could make schemas difficult to read. For now, we can have
>> annotations
>> > > >> printed as part of DESCRIBE statements. If there's a strong need to
>> > > >> suppress annotations for readability, we could explore enriching
>> the syntax
>> > > >> with DESCRIBE [FULL] SCHEMA [WITH ANNOTATIONS], similar to the
>> existing
>> > > >> DESCRIBE [FULL] SCHEMA.
>> > > >>
>> > > >> Thanks,
>> > > >> Jyothsna
>> > > >>
>> > > >> On Fri, Aug 8, 2025 at 10:56 AM Jyothsna Konisa <
>> [email protected]>
>> > > >> wrote:
>> > > >>
>> > > >>> Thanks, Stefan, for your feedback!
>> > > >>>
>> > > >>> To answer your questions,
>> > > >>>
>> > > >>> 1. I agree; annotations can optionally take arguments, and if an
>> > > >>> annotation doesn't have an argument, we can skip the arguments in
>> the
>> > > >>> "DESCRIBE" statement's output.
>> > > >>>
>> > > >>> 2. Good point. We originally considered using "ANNOTATED WITH"
>> but found
>> > > >>> it too verbose. As an alternative, we proposed using "@"
>> preceding the
>> > > >>> annotation to signal it to the parser. We are open to using an
>> explicit
>> > > >>> phrase like "ANNOTATED WITH" if you think it would make the code
>> more
>> > > >>> readable.
>> > > >>>
>> > > >>> A full example of annotations along with constraints and masking
>> could
>> > > >>> be:
>> > > >>>
>> > > >>>
>> > > >>> CREATE TABLE test_ks.test_table (
>> > > >>>     id int PRIMARY KEY,
>> > > >>>     col2 int CHECK col2 > 0 ANNOTATED WITH @PII AND
>> @DESCRIPTION('this
>> > > >>> is column col2') MASKED WITH default()
>> > > >>> );
>> > > >>>
>> > > >>> OR
>> > > >>>
>> > > >>> CREATE TABLE test_ks.test_table (
>> > > >>>     id int PRIMARY KEY,
>> > > >>>     col2 int CHECK col2 > 0 @PII AND @DESCRIPTION('this is column
>> col2')
>> > > >>> MASKED WITH default()
>> > > >>> );
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> 3. We do not have a prototype yet, but I think we will have to
>> introduce
>> > > >>> new parsing branch for annotations at the table level
>> > > >>>
>> > > >>> I hope I answered all your questions!
>> > > >>>
>> > > >>> - Jyothsna
>> > > >>>
>> > > >>> On Thu, Aug 7, 2025 at 11:36 AM Joel Shepherd <
>> [email protected]>
>> > > >>> wrote:
>> > > >>>
>> > > >>>> I like the aim of the CEP. Completely onboard with the idea that
>> GenAI
>> > > >>>> tooling works better when you can provide it useful context
>> about the data
>> > > >>>> it is working with. An organization I worked with in the past
>> had a lot of
>> > > >>>> good results with marking up API models (not DB schemas, but
>> similar idea)
>> > > >>>> with authorization-related annotations and using those to drive
>> policy
>> > > >>>> linters and end-user interfaces. So, sold on the value of the
>> capability.
>> > > >>>>
>> > > >>>> Two things I'm less sure of:
>> > > >>>>
>> > > >>>> 1) User-defined vs pre-defined annotation types: I appreciate the
>> > > >>>> flexibility that user-defined annotations appears to give, but
>> it adds
>> > > >>>> extra room for error. E.g. if annotation names are
>> case-sensitive, do I
>> > > >>>> (the user) have to actively prevent creation of @description?
>> Or, police
>> > > >>>> the accidental creation of alternative names like @Desc? If the
>> community
>> > > >>>> settled on a small, fixed set of supported annotations, so
>> Cassandra itself
>> > > >>>> was authoritative for valid annotation names, would make the
>> feature a lot
>> > > >>>> less valuable, or prevent offering user-defined annotations in
>> the future?
>> > > >>>>
>> > > >>>> 2) Syntactically, is it worth considering other alternatives? I
>> was
>> > > >>>> trying to imagine a CREATE TABLE statement marked up with two or
>> three
>> > > >>>> types of column-level annotations, and my sense is that it could
>> get hard
>> > > >>>> to read quickly. Is it worth considering Javadoc-style
>> annotations in
>> > > >>>> schema comments instead? I think in today's world that means
>> that they
>> > > >>>> would not be accessible via CQL/Cassandra (CQL comments are not
>> persisted
>> > > >>>> as part of the schema, correct?) but they could be accessible to
>> other
>> > > >>>> schema-processing tools and IMO be a more readable syntax. It'd
>> be good to
>> > > >>>> work through a couple use-cases for actually using the data
>> provided by the
>> > > >>>> annotations and get a sense of whether making them first-class
>> entities in
>> > > >>>> CQL is necessary for getting most of the value from them.
>> > > >>>>
>> > > >>>> Thanks -- Joel.
>> > > >>>> On 8/6/2025 6:59 PM, Jyothsna Konisa wrote:
>> > > >>>>
>> > > >>>> Sorry for the incorrect editable link, here is the updated link
>> to the CEP
>> > > >>>> 52: Schema Annotations for ApacheCassandra
>> > > >>>> <
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP+52%3A+Schema+Annotations+for+ApacheCassandra
>> >
>> > > >>>>
>> > > >>>> On Wed, Aug 6, 2025 at 4:26 PM Jyothsna Konisa <
>> [email protected]>
>> > > >>>> wrote:
>> > > >>>>
>> > > >>>>> Hello Everyone!
>> > > >>>>>
>> > > >>>>> We would like to propose CEP 52: Schema Annotations for
>> > > >>>>> ApacheCassandra
>> > > >>>>> <
>> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=373887528&draftShareId=339b7f4e-9bc2-45bd-9a80-b0d4215e3f45&;
>> >
>> > > >>>>>
>> > > >>>>> This CEP outlines a plan to introduce *Schema Annotations* as a
>> way
>> > > >>>>> to add better context to schema elements. We're also proposing
>> a set of new
>> > > >>>>> DDL statements to manage these annotations.
>> > > >>>>>
>> > > >>>>> We believe these annotations will be highly beneficial for
>> several key
>> > > >>>>> areas:
>> > > >>>>>
>> > > >>>>>    -
>> > > >>>>>
>> > > >>>>>    GenAI Applications: Providing more context to LLMs could
>> > > >>>>>    significantly improve the accuracy and relevance of
>> generated content.
>> > > >>>>>    -
>> > > >>>>>
>> > > >>>>>    Data Governance: Annotations can help in enforcing policies
>> using
>> > > >>>>>    annotations
>> > > >>>>>    -
>> > > >>>>>
>> > > >>>>>    Compliance: They can be used to track and manage compliance
>> > > >>>>>    requirements directly within the schema.
>> > > >>>>>
>> > > >>>>> We're eager to hear your thoughts and feedback on this proposal.
>> > > >>>>> Please keep the discussion within this mailing thread.
>> > > >>>>>
>> > > >>>>> Thanks for your time and feedback in advance.
>> > > >>>>>
>> > > >>>>> Best regards,
>> > > >>>>>
>> > > >>>>> Jyothsna & Yifan
>> > > >>>>>
>> > > >>>>>
>> > > >>>>>
>> > > >>>>>
>> > >
>>
>>

Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Reply via email to