Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Štefan Miklošovič Tue, 12 Aug 2025 09:31:50 -0700

One more point I would like to add. If we enrich the output with comments,
I think that seeing comments should be only default if I can take what
DESCRIBE prints and I can copy it as-is and create tables from it.  Very
often, DESCRIBE acts as something like "I will copy this schema here so I
can reconstruct it later". So I would expect that, by default, what
DESCRIBE gives is "reconstructable". I think there are a lot of tests
already which tests what DESCRIBE prints can be reconstructed and this
would need to be preserved.


We might still do "DESCRIBE ks.tb" without comments / annotations and then
"DESCRIBE ks.tb WITH COMMENTS / ANNOTATIONS" to print them.

If we put comments on this it is "reconstructable by copy-pasting" as well:

create table ks.tb
(
    -- my primary key column
    id int primary key,
    -- this is my value
    val text
)

however this is not

create table ks.tb
(
    /**
     my primary key column
    */
    id int primary key,
    val text
)

you got me ...

Also, if we start to automatically enrich DESCRIBE output, it would be very
nice if this was digestible by previous versions. Because if I copy
DESCRIBE output in 5.1 with @PII then I can not just apply that to 5.0
where that concept is not known yet. However plain comments do work in
previous versions as well.

For this reason I would not make annotations visible by default, I would
opt-in by WITH COMMENTS / WITH ANNOTATIONS only and keep the current output
as is.


On Tue, Aug 12, 2025 at 10:56 AM Mick <[email protected]> wrote:

> a point of order and a reminder: aside from suggestions that the CEP
> author is free to adopt or not, anything that's assuming to steer what the
> CEP should be should be accompanied with the willingness to commit in
> helping making it happen.  we want to work as a meritocracy: those that
> lead the work have the say, and blocking their chosen approach against
> their wishes is only on clear technical reasons.  API designs (CQL
> additions) always needs to be chosen and evolved carefully, and every CEP
> proposed should be open to that being naturally part of its discussion
> pre-vote.
>
> following the PG approach does make a lot of sense.
> what are your thoughts on it Jyothsna & Yifan ?
>
>
>
> > On 12 Aug 2025, at 09:14, Štefan Miklošovič <[email protected]>
> wrote:
> >
> > I like the idea of COMMENT ON and alike from PG! Yes, great stuff, as we
> do not invent anything custom and we will be as close as possible to
> industry standard.
> >
> > So, if I understand this correctly, on COMMENT ON, we would save each
> comment to a dedicated table. Then on DESCRIBE, we would "enrich" the CQL
> element we are describing with commentary, if any, from that comment table,
> correct?
> >
> > I, in general, support this idea, but as usual the devil is in the
> details. I am just genuinely curious how this would work in practice.
> >
> >
> > If we go with COMMENT ON, is this going to be stored to TCM or not?
> >
> >
> > If the answer is yes, then it is way more simpler, because then this
> commentary would be dispersed by the means of TCM and each node would apply
> this transformation locally to system_schema.annotations.
> >
> > If the answer is no and if there is a cluster and we do COMMENT ON, then
> this comment has to be saved to a table. If we rule out TCM as a vehicle
> for the dispersion of these comments, that comment table has to be
> distributed / replicated, correct? I do not think that we can create that
> table under system_schema then, as that is on LocalStrategy and all
> modifications to that are, as I understand it, done via TCM?
> >
> > Hence, I guess the better place for that is under system_distributed?
> That means that if somebody changes that keyspace to NTS or nodes are not
> available, we will not be able to create any commentary.
> >
> > Also, if we remove / alter anything, like dropping a keyspace, table,
> index, removing column etc ... all these changes would need to also remove
> respective comments from that table etc etc.
> >
> > For these reasons, I think that having dedicated
> system_schema.annotations table while interacting with it via COMMENT ON to
> be "PG-compatible" so people can query that table directly, and backing
> COMMENT ON by TCM by having it as another transformation (as COMMENT ON is
> inherently part of the schema) is the best way to do this.
> >
> > On Mon, Aug 11, 2025 at 10:55 PM Patrick McFadin <[email protected]>
> wrote:
> > One (of many) reasons I'm advocating we migrate away from CQL. It served
> a purpose at the time, but this project is evolving and this to me seems
> like the logical next iteration. The Cassandra project has built it's
> reputation on what it can do, not clever syntax design. ;)
> >
> > Patrick
> >
> > On Mon, Aug 11, 2025 at 1:51 PM Yifan Cai <[email protected]> wrote:
> > The reasonings on operator and LLM familiarity are spot on.
> >
> > I have experimented with LLM generated queries. It typically does a
> noticeably better job on SQL than CQL.
> >
> > - Yifan
> >
> > On Mon, Aug 11, 2025 at 1:44 PM Patrick McFadin <[email protected]>
> wrote:
> > I really love this CEP.  +1 on the goal.
> >
> > As you've already seen, I've been advocating to improve our syntax
> ergonomics towards more mainstream SQL and avoiding new/custom syntax.  I
> would suggest the following changes towards that goal:
> >  - Reuse PG-shaped DDL. Keep human text in COMMENT ON[1] (map existing
> table comments to that). For structured tags, mirror SECURITY LABEL[2]:
> > SECURITY LABEL FOR <provider> ON <object> IS '<text>';
> >
> > - Allow multiple providers per object. Store the value as text in v1
> (JSON or key/val later if we want), which avoids inventing new inline @
> syntax.
> >
> >  - Avoid new grammar in CREATE/ALTER. Skipping inline @PII keeps schemas
> readable and the grammar simple. Tools can issue COMMENT ON/SECURITY LABEL
> right after DDL, like PG users do today.
> >
> >  - Names & built-ins. Case-insensitive provider names with canonical
> lowercase. No separate @Description type. COMMENT ON already covers that
> use case cleanly.
> >
> >  - Introspection by query and by DESC. Keep annotations visible in
> DESCRIBE, but also expose a single system_schema.annotations view
> (provider, object_type, object_name, sub_name, value) so folks can get all
> annotations for a table. Example: “find all columns labeled PII,” etc.
> >
> > Why PG-like? Besides operator familiarity, there’s far more training
> data and tooling around COMMENT ON/SECURITY LABEL than around bespoke
> @annotation syntax. Sticking to that shape reduces LLM/tool friction and
> avoids teaching the world a new grammar. This has been a huge challenge for
> Cassandra work with LLMs as models tend to drift towards PG SQL in CQL
> often. (No Claude, JOIN is not a keyword in Cassandra)
> >
> > If this direction sounds good, happy to help update the CEP text and
> examples.
> >
> > Patrick
> >
> > 1: COMMENT ON docs
> https://www.postgresql.org/docs/current/sql-comment.html
> > 2: SECURITY LABEL docs
> https://www.postgresql.org/docs/current/sql-security-label.html
> >
> >
> > On Mon, Aug 11, 2025 at 10:18 AM Yifan Cai <[email protected]> wrote:
> > IMO, the full schema or table schema output already makes it possible to
> filter the fields (not limited to columns) that are using certain
> annotations, relatively easily. Grepping or parsing, whichever is more
> suitable for the scenarios; consumers make the call.
> > There is not much added value by providing such a dedicated query,
> however, adding quite a lot of complexity in the design of this CEP. Please
> correct me if I have the wrong understanding of the queries.
> >
> > Another reason for preferring the existing "DESCRIBE" statements is the
> gen-AI enrichment mentioned in the CEP. We most likely want to feed the LLM
> the full (table) schema.
> >
> > The primary goal is to enrich the schema with annotations. Through the
> discussion thread, we will find out whether there is enough motivation to
> support such queries to filter by annotation. I appreciate that you brought
> up the idea.
> >
> > Although we are not at the stage of talking about the implementation,
> just sharing my thoughts a bit, I am thinking of the approach (1) that
> Stefan mentioned.
> >
> > - Yifan
> >
> > On Mon, Aug 11, 2025 at 6:31 AM Francisco Guerrero <[email protected]>
> wrote:
> > Another interesting query would be to retrieve all the fields annotated
> with PII
> > for example.
> >
> > On 2025/08/11 01:01:21 Yifan Cai wrote:
> > > >
> > > > Will there be an option to do a SELECT query to read all the
> annotations
> > > > of a table?
> > >
> > >
> > > It is an interesting question! Would you mind sharing an example of the
> > > output you'd expect from a query like *"SELECT * FROM
> > > system_schema.annotations where keyspace_name=<> and table_name=<>"*?
> I am
> > > curious how that might differ from what we get when running "DESC
> TABLE".
> > >
> > > - Yifan
> > >
> > > On Sat, Aug 9, 2025 at 9:43 AM Jaydeep Chovatia <
> [email protected]>
> > > wrote:
> > >
> > > > >we could explore enriching the syntax with DESCRIBE
> > > >
> > > > Will there be an option to do a SELECT query to read all the
> annotations
> > > > of a table? Something like *"SELECT * FROM system_schema.annotations
> > > > where keyspace_name=<> and table_name=<>"*
> > > > It would be helpful to have a structured CQL query on top of
> printing the
> > > > annotations through DESC so that the information can be consumed
> easily.
> > > >
> > > > Jaydeep
> > > >
> > > > On Fri, Aug 8, 2025 at 11:03 AM Jyothsna Konisa <
> [email protected]>
> > > > wrote:
> > > >
> > > >> Thanks, Joel, for the positive response.
> > > >>
> > > >> 1. User-defined vs. pre-defined annotation types
> > > >>
> > > >> We'd like to have one predefined annotation, Description, but also
> give
> > > >> users the flexibility to create new ones. If a user feels that a
> custom
> > > >> annotation like @Desc suits their use case, they should be allowed
> to use
> > > >> it, as these elements are purely descriptive and have no actions
> associated
> > > >> with them.
> > > >>
> > > >> 2. Syntactically, is it worth considering other alternatives?
> > > >>
> > > >> You're concerned that having several annotations on multiple columns
> > > >> could make schemas difficult to read. For now, we can have
> annotations
> > > >> printed as part of DESCRIBE statements. If there's a strong need to
> > > >> suppress annotations for readability, we could explore enriching
> the syntax
> > > >> with DESCRIBE [FULL] SCHEMA [WITH ANNOTATIONS], similar to the
> existing
> > > >> DESCRIBE [FULL] SCHEMA.
> > > >>
> > > >> Thanks,
> > > >> Jyothsna
> > > >>
> > > >> On Fri, Aug 8, 2025 at 10:56 AM Jyothsna Konisa <
> [email protected]>
> > > >> wrote:
> > > >>
> > > >>> Thanks, Stefan, for your feedback!
> > > >>>
> > > >>> To answer your questions,
> > > >>>
> > > >>> 1. I agree; annotations can optionally take arguments, and if an
> > > >>> annotation doesn't have an argument, we can skip the arguments in
> the
> > > >>> "DESCRIBE" statement's output.
> > > >>>
> > > >>> 2. Good point. We originally considered using "ANNOTATED WITH" but
> found
> > > >>> it too verbose. As an alternative, we proposed using "@" preceding
> the
> > > >>> annotation to signal it to the parser. We are open to using an
> explicit
> > > >>> phrase like "ANNOTATED WITH" if you think it would make the code
> more
> > > >>> readable.
> > > >>>
> > > >>> A full example of annotations along with constraints and masking
> could
> > > >>> be:
> > > >>>
> > > >>>
> > > >>> CREATE TABLE test_ks.test_table (
> > > >>>     id int PRIMARY KEY,
> > > >>>     col2 int CHECK col2 > 0 ANNOTATED WITH @PII AND
> @DESCRIPTION('this
> > > >>> is column col2') MASKED WITH default()
> > > >>> );
> > > >>>
> > > >>> OR
> > > >>>
> > > >>> CREATE TABLE test_ks.test_table (
> > > >>>     id int PRIMARY KEY,
> > > >>>     col2 int CHECK col2 > 0 @PII AND @DESCRIPTION('this is column
> col2')
> > > >>> MASKED WITH default()
> > > >>> );
> > > >>>
> > > >>>
> > > >>>
> > > >>> 3. We do not have a prototype yet, but I think we will have to
> introduce
> > > >>> new parsing branch for annotations at the table level
> > > >>>
> > > >>> I hope I answered all your questions!
> > > >>>
> > > >>> - Jyothsna
> > > >>>
> > > >>> On Thu, Aug 7, 2025 at 11:36 AM Joel Shepherd <[email protected]
> >
> > > >>> wrote:
> > > >>>
> > > >>>> I like the aim of the CEP. Completely onboard with the idea that
> GenAI
> > > >>>> tooling works better when you can provide it useful context about
> the data
> > > >>>> it is working with. An organization I worked with in the past had
> a lot of
> > > >>>> good results with marking up API models (not DB schemas, but
> similar idea)
> > > >>>> with authorization-related annotations and using those to drive
> policy
> > > >>>> linters and end-user interfaces. So, sold on the value of the
> capability.
> > > >>>>
> > > >>>> Two things I'm less sure of:
> > > >>>>
> > > >>>> 1) User-defined vs pre-defined annotation types: I appreciate the
> > > >>>> flexibility that user-defined annotations appears to give, but it
> adds
> > > >>>> extra room for error. E.g. if annotation names are
> case-sensitive, do I
> > > >>>> (the user) have to actively prevent creation of @description? Or,
> police
> > > >>>> the accidental creation of alternative names like @Desc? If the
> community
> > > >>>> settled on a small, fixed set of supported annotations, so
> Cassandra itself
> > > >>>> was authoritative for valid annotation names, would make the
> feature a lot
> > > >>>> less valuable, or prevent offering user-defined annotations in
> the future?
> > > >>>>
> > > >>>> 2) Syntactically, is it worth considering other alternatives? I
> was
> > > >>>> trying to imagine a CREATE TABLE statement marked up with two or
> three
> > > >>>> types of column-level annotations, and my sense is that it could
> get hard
> > > >>>> to read quickly. Is it worth considering Javadoc-style
> annotations in
> > > >>>> schema comments instead? I think in today's world that means that
> they
> > > >>>> would not be accessible via CQL/Cassandra (CQL comments are not
> persisted
> > > >>>> as part of the schema, correct?) but they could be accessible to
> other
> > > >>>> schema-processing tools and IMO be a more readable syntax. It'd
> be good to
> > > >>>> work through a couple use-cases for actually using the data
> provided by the
> > > >>>> annotations and get a sense of whether making them first-class
> entities in
> > > >>>> CQL is necessary for getting most of the value from them.
> > > >>>>
> > > >>>> Thanks -- Joel.
> > > >>>> On 8/6/2025 6:59 PM, Jyothsna Konisa wrote:
> > > >>>>
> > > >>>> Sorry for the incorrect editable link, here is the updated link
> to the CEP
> > > >>>> 52: Schema Annotations for ApacheCassandra
> > > >>>> <
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP+52%3A+Schema+Annotations+for+ApacheCassandra
> >
> > > >>>>
> > > >>>> On Wed, Aug 6, 2025 at 4:26 PM Jyothsna Konisa <
> [email protected]>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hello Everyone!
> > > >>>>>
> > > >>>>> We would like to propose CEP 52: Schema Annotations for
> > > >>>>> ApacheCassandra
> > > >>>>> <
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=373887528&draftShareId=339b7f4e-9bc2-45bd-9a80-b0d4215e3f45&;
> >
> > > >>>>>
> > > >>>>> This CEP outlines a plan to introduce *Schema Annotations* as a
> way
> > > >>>>> to add better context to schema elements. We're also proposing a
> set of new
> > > >>>>> DDL statements to manage these annotations.
> > > >>>>>
> > > >>>>> We believe these annotations will be highly beneficial for
> several key
> > > >>>>> areas:
> > > >>>>>
> > > >>>>>    -
> > > >>>>>
> > > >>>>>    GenAI Applications: Providing more context to LLMs could
> > > >>>>>    significantly improve the accuracy and relevance of generated
> content.
> > > >>>>>    -
> > > >>>>>
> > > >>>>>    Data Governance: Annotations can help in enforcing policies
> using
> > > >>>>>    annotations
> > > >>>>>    -
> > > >>>>>
> > > >>>>>    Compliance: They can be used to track and manage compliance
> > > >>>>>    requirements directly within the schema.
> > > >>>>>
> > > >>>>> We're eager to hear your thoughts and feedback on this proposal.
> > > >>>>> Please keep the discussion within this mailing thread.
> > > >>>>>
> > > >>>>> Thanks for your time and feedback in advance.
> > > >>>>>
> > > >>>>> Best regards,
> > > >>>>>
> > > >>>>> Jyothsna & Yifan
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > >
>
>

Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Reply via email to