Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Yifan Cai Mon, 11 Aug 2025 13:51:47 -0700

The reasonings on operator and LLM familiarity are spot on.

I have experimented with LLM generated queries. It typically does a
noticeably better job on SQL than CQL.


- Yifan

On Mon, Aug 11, 2025 at 1:44 PM Patrick McFadin <[email protected]> wrote:

> I really love this CEP.  +1 on the goal.
>
> As you've already seen, I've been advocating to improve our syntax
> ergonomics towards more mainstream SQL and avoiding new/custom syntax.  I
> would suggest the following changes towards that goal:
>  - Reuse PG-shaped DDL. Keep human text in COMMENT ON[1] (map existing
> table comments to that). For structured tags, mirror SECURITY LABEL[2]:
> SECURITY LABEL FOR <provider> ON <object> IS '<text>';
>
> - Allow multiple providers per object. Store the value as text in v1 (JSON
> or key/val later if we want), which avoids inventing new inline @ syntax.
>
>  - Avoid new grammar in CREATE/ALTER. Skipping inline @PII keeps schemas
> readable and the grammar simple. Tools can issue COMMENT ON/SECURITY LABEL
> right after DDL, like PG users do today.
>
>  - Names & built-ins. Case-insensitive provider names with canonical
> lowercase. No separate @Description type. COMMENT ON already covers that
> use case cleanly.
>
>  - Introspection by query and by DESC. Keep annotations visible in
> DESCRIBE, but also expose a single system_schema.annotations view
> (provider, object_type, object_name, sub_name, value) so folks can get all
> annotations for a table. Example: “find all columns labeled PII,” etc.
>
> Why PG-like? Besides operator familiarity, there’s far more training data
> and tooling around COMMENT ON/SECURITY LABEL than around bespoke
> @annotation syntax. Sticking to that shape reduces LLM/tool friction and
> avoids teaching the world a new grammar. This has been a huge challenge for
> Cassandra work with LLMs as models tend to drift towards PG SQL in CQL
> often. (No Claude, JOIN is not a keyword in Cassandra)
>
> If this direction sounds good, happy to help update the CEP text and
> examples.
>
> Patrick
>
> 1: COMMENT ON docs
> https://www.postgresql.org/docs/current/sql-comment.html
> 2: SECURITY LABEL docs
> https://www.postgresql.org/docs/current/sql-security-label.html
>
>
> On Mon, Aug 11, 2025 at 10:18 AM Yifan Cai <[email protected]> wrote:
>
>> IMO, the full schema or table schema output already makes it possible to
>> filter the fields (not limited to columns) that are using certain
>> annotations, relatively easily. Grepping or parsing, whichever is more
>> suitable for the scenarios; consumers make the call.
>> There is not much added value by providing such a dedicated query,
>> however, adding quite a lot of complexity in the design of this CEP. Please
>> correct me if I have the wrong understanding of the queries.
>>
>> Another reason for preferring the existing "DESCRIBE" statements is the
>> gen-AI enrichment mentioned in the CEP. We most likely want to feed the LLM
>> the full (table) schema.
>>
>> The primary goal is to enrich the schema with annotations. Through the
>> discussion thread, we will find out whether there is enough motivation to
>> support such queries to filter by annotation. I appreciate that you brought
>> up the idea.
>>
>> Although we are not at the stage of talking about the implementation,
>> just sharing my thoughts a bit, I am thinking of the approach (1) that
>> Stefan mentioned.
>>
>> - Yifan
>>
>> On Mon, Aug 11, 2025 at 6:31 AM Francisco Guerrero <[email protected]>
>> wrote:
>>
>>> Another interesting query would be to retrieve all the fields annotated
>>> with PII
>>> for example.
>>>
>>> On 2025/08/11 01:01:21 Yifan Cai wrote:
>>> > >
>>> > > Will there be an option to do a SELECT query to read all the
>>> annotations
>>> > > of a table?
>>> >
>>> >
>>> > It is an interesting question! Would you mind sharing an example of the
>>> > output you'd expect from a query like *"SELECT * FROM
>>> > system_schema.annotations where keyspace_name=<> and table_name=<>"*?
>>> I am
>>> > curious how that might differ from what we get when running "DESC
>>> TABLE".
>>> >
>>> > - Yifan
>>> >
>>> > On Sat, Aug 9, 2025 at 9:43 AM Jaydeep Chovatia <
>>> [email protected]>
>>> > wrote:
>>> >
>>> > > >we could explore enriching the syntax with DESCRIBE
>>> > >
>>> > > Will there be an option to do a SELECT query to read all the
>>> annotations
>>> > > of a table? Something like *"SELECT * FROM system_schema.annotations
>>> > > where keyspace_name=<> and table_name=<>"*
>>> > > It would be helpful to have a structured CQL query on top of
>>> printing the
>>> > > annotations through DESC so that the information can be consumed
>>> easily.
>>> > >
>>> > > Jaydeep
>>> > >
>>> > > On Fri, Aug 8, 2025 at 11:03 AM Jyothsna Konisa <
>>> [email protected]>
>>> > > wrote:
>>> > >
>>> > >> Thanks, Joel, for the positive response.
>>> > >>
>>> > >> 1. User-defined vs. pre-defined annotation types
>>> > >>
>>> > >> We'd like to have one predefined annotation, Description, but also
>>> give
>>> > >> users the flexibility to create new ones. If a user feels that a
>>> custom
>>> > >> annotation like @Desc suits their use case, they should be allowed
>>> to use
>>> > >> it, as these elements are purely descriptive and have no actions
>>> associated
>>> > >> with them.
>>> > >>
>>> > >> 2. Syntactically, is it worth considering other alternatives?
>>> > >>
>>> > >> You're concerned that having several annotations on multiple columns
>>> > >> could make schemas difficult to read. For now, we can have
>>> annotations
>>> > >> printed as part of DESCRIBE statements. If there's a strong need to
>>> > >> suppress annotations for readability, we could explore enriching
>>> the syntax
>>> > >> with DESCRIBE [FULL] SCHEMA [WITH ANNOTATIONS], similar to the
>>> existing
>>> > >> DESCRIBE [FULL] SCHEMA.
>>> > >>
>>> > >> Thanks,
>>> > >> Jyothsna
>>> > >>
>>> > >> On Fri, Aug 8, 2025 at 10:56 AM Jyothsna Konisa <
>>> [email protected]>
>>> > >> wrote:
>>> > >>
>>> > >>> Thanks, Stefan, for your feedback!
>>> > >>>
>>> > >>> To answer your questions,
>>> > >>>
>>> > >>> 1. I agree; annotations can optionally take arguments, and if an
>>> > >>> annotation doesn't have an argument, we can skip the arguments in
>>> the
>>> > >>> "DESCRIBE" statement's output.
>>> > >>>
>>> > >>> 2. Good point. We originally considered using "ANNOTATED WITH" but
>>> found
>>> > >>> it too verbose. As an alternative, we proposed using "@" preceding
>>> the
>>> > >>> annotation to signal it to the parser. We are open to using an
>>> explicit
>>> > >>> phrase like "ANNOTATED WITH" if you think it would make the code
>>> more
>>> > >>> readable.
>>> > >>>
>>> > >>> A full example of annotations along with constraints and masking
>>> could
>>> > >>> be:
>>> > >>>
>>> > >>>
>>> > >>> CREATE TABLE test_ks.test_table (
>>> > >>>     id int PRIMARY KEY,
>>> > >>>     col2 int CHECK col2 > 0 ANNOTATED WITH @PII AND
>>> @DESCRIPTION('this
>>> > >>> is column col2') MASKED WITH default()
>>> > >>> );
>>> > >>>
>>> > >>> OR
>>> > >>>
>>> > >>> CREATE TABLE test_ks.test_table (
>>> > >>>     id int PRIMARY KEY,
>>> > >>>     col2 int CHECK col2 > 0 @PII AND @DESCRIPTION('this is column
>>> col2')
>>> > >>> MASKED WITH default()
>>> > >>> );
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> 3. We do not have a prototype yet, but I think we will have to
>>> introduce
>>> > >>> new parsing branch for annotations at the table level
>>> > >>>
>>> > >>> I hope I answered all your questions!
>>> > >>>
>>> > >>> - Jyothsna
>>> > >>>
>>> > >>> On Thu, Aug 7, 2025 at 11:36 AM Joel Shepherd <[email protected]
>>> >
>>> > >>> wrote:
>>> > >>>
>>> > >>>> I like the aim of the CEP. Completely onboard with the idea that
>>> GenAI
>>> > >>>> tooling works better when you can provide it useful context about
>>> the data
>>> > >>>> it is working with. An organization I worked with in the past had
>>> a lot of
>>> > >>>> good results with marking up API models (not DB schemas, but
>>> similar idea)
>>> > >>>> with authorization-related annotations and using those to drive
>>> policy
>>> > >>>> linters and end-user interfaces. So, sold on the value of the
>>> capability.
>>> > >>>>
>>> > >>>> Two things I'm less sure of:
>>> > >>>>
>>> > >>>> 1) User-defined vs pre-defined annotation types: I appreciate the
>>> > >>>> flexibility that user-defined annotations appears to give, but it
>>> adds
>>> > >>>> extra room for error. E.g. if annotation names are
>>> case-sensitive, do I
>>> > >>>> (the user) have to actively prevent creation of @description? Or,
>>> police
>>> > >>>> the accidental creation of alternative names like @Desc? If the
>>> community
>>> > >>>> settled on a small, fixed set of supported annotations, so
>>> Cassandra itself
>>> > >>>> was authoritative for valid annotation names, would make the
>>> feature a lot
>>> > >>>> less valuable, or prevent offering user-defined annotations in
>>> the future?
>>> > >>>>
>>> > >>>> 2) Syntactically, is it worth considering other alternatives? I
>>> was
>>> > >>>> trying to imagine a CREATE TABLE statement marked up with two or
>>> three
>>> > >>>> types of column-level annotations, and my sense is that it could
>>> get hard
>>> > >>>> to read quickly. Is it worth considering Javadoc-style
>>> annotations in
>>> > >>>> schema comments instead? I think in today's world that means that
>>> they
>>> > >>>> would not be accessible via CQL/Cassandra (CQL comments are not
>>> persisted
>>> > >>>> as part of the schema, correct?) but they could be accessible to
>>> other
>>> > >>>> schema-processing tools and IMO be a more readable syntax. It'd
>>> be good to
>>> > >>>> work through a couple use-cases for actually using the data
>>> provided by the
>>> > >>>> annotations and get a sense of whether making them first-class
>>> entities in
>>> > >>>> CQL is necessary for getting most of the value from them.
>>> > >>>>
>>> > >>>> Thanks -- Joel.
>>> > >>>> On 8/6/2025 6:59 PM, Jyothsna Konisa wrote:
>>> > >>>>
>>> > >>>> Sorry for the incorrect editable link, here is the updated link
>>> to the CEP
>>> > >>>> 52: Schema Annotations for ApacheCassandra
>>> > >>>> <
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP+52%3A+Schema+Annotations+for+ApacheCassandra
>>> >
>>> > >>>>
>>> > >>>> On Wed, Aug 6, 2025 at 4:26 PM Jyothsna Konisa <
>>> [email protected]>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>>> Hello Everyone!
>>> > >>>>>
>>> > >>>>> We would like to propose CEP 52: Schema Annotations for
>>> > >>>>> ApacheCassandra
>>> > >>>>> <
>>> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=373887528&draftShareId=339b7f4e-9bc2-45bd-9a80-b0d4215e3f45&;
>>> >
>>> > >>>>>
>>> > >>>>> This CEP outlines a plan to introduce *Schema Annotations* as a
>>> way
>>> > >>>>> to add better context to schema elements. We're also proposing a
>>> set of new
>>> > >>>>> DDL statements to manage these annotations.
>>> > >>>>>
>>> > >>>>> We believe these annotations will be highly beneficial for
>>> several key
>>> > >>>>> areas:
>>> > >>>>>
>>> > >>>>>    -
>>> > >>>>>
>>> > >>>>>    GenAI Applications: Providing more context to LLMs could
>>> > >>>>>    significantly improve the accuracy and relevance of generated
>>> content.
>>> > >>>>>    -
>>> > >>>>>
>>> > >>>>>    Data Governance: Annotations can help in enforcing policies
>>> using
>>> > >>>>>    annotations
>>> > >>>>>    -
>>> > >>>>>
>>> > >>>>>    Compliance: They can be used to track and manage compliance
>>> > >>>>>    requirements directly within the schema.
>>> > >>>>>
>>> > >>>>> We're eager to hear your thoughts and feedback on this proposal.
>>> > >>>>> Please keep the discussion within this mailing thread.
>>> > >>>>>
>>> > >>>>> Thanks for your time and feedback in advance.
>>> > >>>>>
>>> > >>>>> Best regards,
>>> > >>>>>
>>> > >>>>> Jyothsna & Yifan
>>> > >>>>>
>>> > >>>>>
>>> > >>>>>
>>> > >>>>>
>>> >
>>>
>>

Re: [DISCUSS] CEP 52: Schema Annotations for ApacheCassandra

Reply via email to