Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Mick Tue, 12 Aug 2025 01:56:55 -0700

a point of order and a reminder: aside from suggestions that the CEP author is 
free to adopt or not, anything that's assuming to steer what the CEP should be 
should be accompanied with the willingness to commit in helping making it 
happen.  we want to work as a meritocracy: those that lead the work have the 
say, and blocking their chosen approach against their wishes is only on clear 
technical reasons.  API designs (CQL additions) always needs to be chosen and 
evolved carefully, and every CEP proposed should be open to that being 
naturally part of its discussion pre-vote.


following the PG approach does make a lot of sense.
what are your thoughts on it Jyothsna & Yifan ?



> On 12 Aug 2025, at 09:14, Štefan Miklošovič <[email protected]> wrote:
> 
> I like the idea of COMMENT ON and alike from PG! Yes, great stuff, as we do 
> not invent anything custom and we will be as close as possible to industry 
> standard. 
> 
> So, if I understand this correctly, on COMMENT ON, we would save each comment 
> to a dedicated table. Then on DESCRIBE, we would "enrich" the CQL element we 
> are describing with commentary, if any, from that comment table, correct?
> 
> I, in general, support this idea, but as usual the devil is in the details. I 
> am just genuinely curious how this would work in practice.
> 
> 
> If we go with COMMENT ON, is this going to be stored to TCM or not?
> 
> 
> If the answer is yes, then it is way more simpler, because then this 
> commentary would be dispersed by the means of TCM and each node would apply 
> this transformation locally to system_schema.annotations.
> 
> If the answer is no and if there is a cluster and we do COMMENT ON, then this 
> comment has to be saved to a table. If we rule out TCM as a vehicle for the 
> dispersion of these comments, that comment table has to be distributed / 
> replicated, correct? I do not think that we can create that table under 
> system_schema then, as that is on LocalStrategy and all modifications to that 
> are, as I understand it, done via TCM?
> 
> Hence, I guess the better place for that is under system_distributed? That 
> means that if somebody changes that keyspace to NTS or nodes are not 
> available, we will not be able to create any commentary.
> 
> Also, if we remove / alter anything, like dropping a keyspace, table, index, 
> removing column etc ... all these changes would need to also remove 
> respective comments from that table etc etc.
> 
> For these reasons, I think that having dedicated system_schema.annotations 
> table while interacting with it via COMMENT ON to be "PG-compatible" so 
> people can query that table directly, and backing COMMENT ON by TCM by having 
> it as another transformation (as COMMENT ON is inherently part of the schema) 
> is the best way to do this. 
> 
> On Mon, Aug 11, 2025 at 10:55 PM Patrick McFadin <[email protected]> wrote:
> One (of many) reasons I'm advocating we migrate away from CQL. It served a 
> purpose at the time, but this project is evolving and this to me seems like 
> the logical next iteration. The Cassandra project has built it's reputation 
> on what it can do, not clever syntax design. ;) 
> 
> Patrick
> 
> On Mon, Aug 11, 2025 at 1:51 PM Yifan Cai <[email protected]> wrote:
> The reasonings on operator and LLM familiarity are spot on. 
> 
> I have experimented with LLM generated queries. It typically does a 
> noticeably better job on SQL than CQL. 
> 
> - Yifan
> 
> On Mon, Aug 11, 2025 at 1:44 PM Patrick McFadin <[email protected]> wrote:
> I really love this CEP.  +1 on the goal. 
> 
> As you've already seen, I've been advocating to improve our syntax ergonomics 
> towards more mainstream SQL and avoiding new/custom syntax.  I would suggest 
> the following changes towards that goal:
>  - Reuse PG-shaped DDL. Keep human text in COMMENT ON[1] (map existing table 
> comments to that). For structured tags, mirror SECURITY LABEL[2]:
> SECURITY LABEL FOR <provider> ON <object> IS '<text>'; 
> 
> - Allow multiple providers per object. Store the value as text in v1 (JSON or 
> key/val later if we want), which avoids inventing new inline @ syntax.
> 
>  - Avoid new grammar in CREATE/ALTER. Skipping inline @PII keeps schemas 
> readable and the grammar simple. Tools can issue COMMENT ON/SECURITY LABEL 
> right after DDL, like PG users do today.
> 
>  - Names & built-ins. Case-insensitive provider names with canonical 
> lowercase. No separate @Description type. COMMENT ON already covers that use 
> case cleanly.
> 
>  - Introspection by query and by DESC. Keep annotations visible in DESCRIBE, 
> but also expose a single system_schema.annotations view (provider, 
> object_type, object_name, sub_name, value) so folks can get all annotations 
> for a table. Example: “find all columns labeled PII,” etc.
> 
> Why PG-like? Besides operator familiarity, there’s far more training data and 
> tooling around COMMENT ON/SECURITY LABEL than around bespoke @annotation 
> syntax. Sticking to that shape reduces LLM/tool friction and avoids teaching 
> the world a new grammar. This has been a huge challenge for Cassandra work 
> with LLMs as models tend to drift towards PG SQL in CQL often. (No Claude, 
> JOIN is not a keyword in Cassandra) 
> 
> If this direction sounds good, happy to help update the CEP text and examples.
> 
> Patrick
> 
> 1: COMMENT ON docs https://www.postgresql.org/docs/current/sql-comment.html
> 2: SECURITY LABEL docs 
> https://www.postgresql.org/docs/current/sql-security-label.html
> 
> 
> On Mon, Aug 11, 2025 at 10:18 AM Yifan Cai <[email protected]> wrote:
> IMO, the full schema or table schema output already makes it possible to 
> filter the fields (not limited to columns) that are using certain 
> annotations, relatively easily. Grepping or parsing, whichever is more 
> suitable for the scenarios; consumers make the call. 
> There is not much added value by providing such a dedicated query, however, 
> adding quite a lot of complexity in the design of this CEP. Please correct me 
> if I have the wrong understanding of the queries. 
> 
> Another reason for preferring the existing "DESCRIBE" statements is the 
> gen-AI enrichment mentioned in the CEP. We most likely want to feed the LLM 
> the full (table) schema. 
> 
> The primary goal is to enrich the schema with annotations. Through the 
> discussion thread, we will find out whether there is enough motivation to 
> support such queries to filter by annotation. I appreciate that you brought 
> up the idea. 
> 
> Although we are not at the stage of talking about the implementation, just 
> sharing my thoughts a bit, I am thinking of the approach (1) that Stefan 
> mentioned. 
> 
> - Yifan
> 
> On Mon, Aug 11, 2025 at 6:31 AM Francisco Guerrero <[email protected]> wrote:
> Another interesting query would be to retrieve all the fields annotated with 
> PII
> for example.
> 
> On 2025/08/11 01:01:21 Yifan Cai wrote:
> > >
> > > Will there be an option to do a SELECT query to read all the annotations
> > > of a table?
> > 
> > 
> > It is an interesting question! Would you mind sharing an example of the
> > output you'd expect from a query like *"SELECT * FROM
> > system_schema.annotations where keyspace_name=<> and table_name=<>"*? I am
> > curious how that might differ from what we get when running "DESC TABLE".
> > 
> > - Yifan
> > 
> > On Sat, Aug 9, 2025 at 9:43 AM Jaydeep Chovatia <[email protected]>
> > wrote:
> > 
> > > >we could explore enriching the syntax with DESCRIBE
> > >
> > > Will there be an option to do a SELECT query to read all the annotations
> > > of a table? Something like *"SELECT * FROM system_schema.annotations
> > > where keyspace_name=<> and table_name=<>"*
> > > It would be helpful to have a structured CQL query on top of printing the
> > > annotations through DESC so that the information can be consumed easily.
> > >
> > > Jaydeep
> > >
> > > On Fri, Aug 8, 2025 at 11:03 AM Jyothsna Konisa <[email protected]>
> > > wrote:
> > >
> > >> Thanks, Joel, for the positive response.
> > >>
> > >> 1. User-defined vs. pre-defined annotation types
> > >>
> > >> We'd like to have one predefined annotation, Description, but also give
> > >> users the flexibility to create new ones. If a user feels that a custom
> > >> annotation like @Desc suits their use case, they should be allowed to use
> > >> it, as these elements are purely descriptive and have no actions 
> > >> associated
> > >> with them.
> > >>
> > >> 2. Syntactically, is it worth considering other alternatives?
> > >>
> > >> You're concerned that having several annotations on multiple columns
> > >> could make schemas difficult to read. For now, we can have annotations
> > >> printed as part of DESCRIBE statements. If there's a strong need to
> > >> suppress annotations for readability, we could explore enriching the 
> > >> syntax
> > >> with DESCRIBE [FULL] SCHEMA [WITH ANNOTATIONS], similar to the existing
> > >> DESCRIBE [FULL] SCHEMA.
> > >>
> > >> Thanks,
> > >> Jyothsna
> > >>
> > >> On Fri, Aug 8, 2025 at 10:56 AM Jyothsna Konisa <[email protected]>
> > >> wrote:
> > >>
> > >>> Thanks, Stefan, for your feedback!
> > >>>
> > >>> To answer your questions,
> > >>>
> > >>> 1. I agree; annotations can optionally take arguments, and if an
> > >>> annotation doesn't have an argument, we can skip the arguments in the
> > >>> "DESCRIBE" statement's output.
> > >>>
> > >>> 2. Good point. We originally considered using "ANNOTATED WITH" but found
> > >>> it too verbose. As an alternative, we proposed using "@" preceding the
> > >>> annotation to signal it to the parser. We are open to using an explicit
> > >>> phrase like "ANNOTATED WITH" if you think it would make the code more
> > >>> readable.
> > >>>
> > >>> A full example of annotations along with constraints and masking could
> > >>> be:
> > >>>
> > >>>
> > >>> CREATE TABLE test_ks.test_table (
> > >>>     id int PRIMARY KEY,
> > >>>     col2 int CHECK col2 > 0 ANNOTATED WITH @PII AND @DESCRIPTION('this
> > >>> is column col2') MASKED WITH default()
> > >>> );
> > >>>
> > >>> OR
> > >>>
> > >>> CREATE TABLE test_ks.test_table (
> > >>>     id int PRIMARY KEY,
> > >>>     col2 int CHECK col2 > 0 @PII AND @DESCRIPTION('this is column col2')
> > >>> MASKED WITH default()
> > >>> );
> > >>>
> > >>>
> > >>>
> > >>> 3. We do not have a prototype yet, but I think we will have to introduce
> > >>> new parsing branch for annotations at the table level
> > >>>
> > >>> I hope I answered all your questions!
> > >>>
> > >>> - Jyothsna
> > >>>
> > >>> On Thu, Aug 7, 2025 at 11:36 AM Joel Shepherd <[email protected]>
> > >>> wrote:
> > >>>
> > >>>> I like the aim of the CEP. Completely onboard with the idea that GenAI
> > >>>> tooling works better when you can provide it useful context about the 
> > >>>> data
> > >>>> it is working with. An organization I worked with in the past had a 
> > >>>> lot of
> > >>>> good results with marking up API models (not DB schemas, but similar 
> > >>>> idea)
> > >>>> with authorization-related annotations and using those to drive policy
> > >>>> linters and end-user interfaces. So, sold on the value of the 
> > >>>> capability.
> > >>>>
> > >>>> Two things I'm less sure of:
> > >>>>
> > >>>> 1) User-defined vs pre-defined annotation types: I appreciate the
> > >>>> flexibility that user-defined annotations appears to give, but it adds
> > >>>> extra room for error. E.g. if annotation names are case-sensitive, do I
> > >>>> (the user) have to actively prevent creation of @description? Or, 
> > >>>> police
> > >>>> the accidental creation of alternative names like @Desc? If the 
> > >>>> community
> > >>>> settled on a small, fixed set of supported annotations, so Cassandra 
> > >>>> itself
> > >>>> was authoritative for valid annotation names, would make the feature a 
> > >>>> lot
> > >>>> less valuable, or prevent offering user-defined annotations in the 
> > >>>> future?
> > >>>>
> > >>>> 2) Syntactically, is it worth considering other alternatives? I was
> > >>>> trying to imagine a CREATE TABLE statement marked up with two or three
> > >>>> types of column-level annotations, and my sense is that it could get 
> > >>>> hard
> > >>>> to read quickly. Is it worth considering Javadoc-style annotations in
> > >>>> schema comments instead? I think in today's world that means that they
> > >>>> would not be accessible via CQL/Cassandra (CQL comments are not 
> > >>>> persisted
> > >>>> as part of the schema, correct?) but they could be accessible to other
> > >>>> schema-processing tools and IMO be a more readable syntax. It'd be 
> > >>>> good to
> > >>>> work through a couple use-cases for actually using the data provided 
> > >>>> by the
> > >>>> annotations and get a sense of whether making them first-class 
> > >>>> entities in
> > >>>> CQL is necessary for getting most of the value from them.
> > >>>>
> > >>>> Thanks -- Joel.
> > >>>> On 8/6/2025 6:59 PM, Jyothsna Konisa wrote:
> > >>>>
> > >>>> Sorry for the incorrect editable link, here is the updated link to the 
> > >>>> CEP
> > >>>> 52: Schema Annotations for ApacheCassandra
> > >>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP+52%3A+Schema+Annotations+for+ApacheCassandra>
> > >>>>
> > >>>> On Wed, Aug 6, 2025 at 4:26 PM Jyothsna Konisa <[email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> Hello Everyone!
> > >>>>>
> > >>>>> We would like to propose CEP 52: Schema Annotations for
> > >>>>> ApacheCassandra
> > >>>>> <https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=373887528&draftShareId=339b7f4e-9bc2-45bd-9a80-b0d4215e3f45&;>
> > >>>>>
> > >>>>> This CEP outlines a plan to introduce *Schema Annotations* as a way
> > >>>>> to add better context to schema elements. We're also proposing a set 
> > >>>>> of new
> > >>>>> DDL statements to manage these annotations.
> > >>>>>
> > >>>>> We believe these annotations will be highly beneficial for several key
> > >>>>> areas:
> > >>>>>
> > >>>>>    -
> > >>>>>
> > >>>>>    GenAI Applications: Providing more context to LLMs could
> > >>>>>    significantly improve the accuracy and relevance of generated 
> > >>>>> content.
> > >>>>>    -
> > >>>>>
> > >>>>>    Data Governance: Annotations can help in enforcing policies using
> > >>>>>    annotations
> > >>>>>    -
> > >>>>>
> > >>>>>    Compliance: They can be used to track and manage compliance
> > >>>>>    requirements directly within the schema.
> > >>>>>
> > >>>>> We're eager to hear your thoughts and feedback on this proposal.
> > >>>>> Please keep the discussion within this mailing thread.
> > >>>>>
> > >>>>> Thanks for your time and feedback in advance.
> > >>>>>
> > >>>>> Best regards,
> > >>>>>
> > >>>>> Jyothsna & Yifan
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> >

Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Reply via email to