That sounds great to me, and indeed will stop someone from inadvertently clobbering the header. It fits naturally with how licensing is noted in source files in any case- just piecemeal rather than at the granularity of files.
On Thu, Dec 18, 2025 at 11:33 AM Peeyush Gupta <[email protected]> wrote: > > The clarification makes sense. I would like to propose another option for > discussion, which is to add Generated-By as a comment in the code itself. > This will be more fine-grained and will allow committers to know which piece > of code is LLM generated while updating/reviewing the code. This will also > avoid missing generated-by field in case of cherry picks etc. > > From: Ian Maxon <[email protected]> > Date: Thursday, December 18, 2025 at 11:18 AM > To: [email protected] <[email protected]> > Subject: Re: [DISCUSS] Adding information about AI usage into commit messages > > Good points Peeyush. I can't really think of a perfect answer. The > best one I can think of is simply to leave it at the committer's > discretion based on some general guidelines. > > For example, determining if the output is simply a regurgitation of > someone else's code is not easy. However, certainly sometimes it is > obvious. Things like very specific and unrelated comments, or > unrelated code that has method or variable names that give it away. > For those cases, we can clearly say the code is not a synthesis of the > prompt and training data as a whole. Basically, you know it when you > see it, but if you don't, then it's fine. > > I think tab completions are also another case where judgement can > apply. Certainly tab completion is an old tool, and technically > completions done by non-LLM methods could fall into the same traps by > using snippets and macros. However an obvious next line of code is not > something most people would call an original or creative work. So in > those cases, I would judge it's not worth mentioning. If it's > generating entire methods and classes, then it's probably worth > mentioning, because that is more substantive output from the tool, not > just an obvious addition. > > Committers are already trusted to give proper attribution to code they > commit, so overall I think this is just a corollary to that. > > On Thu, Dec 18, 2025 at 10:08 AM Peeyush Gupta > <[email protected]> wrote: > > > > Sounds like I good idea to me but needs more clarification. > > > > > > 1. > > How to find out if the output of the LLM tool is part of its training data. > > 2. > > I use, LLM based tools for almost all patches for code completion. > > Sometimes the code completion could be just a few words. > > Do we need to include “Generated-by” in such cases as well? If yes, won’t > > it make almost all commits to have this field set. > > > > From: [email protected] <[email protected]> > > Date: Thursday, December 18, 2025 at 10:01 AM > > To: [email protected] <[email protected]> > > Subject: Re: [DISCUSS] Adding information about AI usage into commit > > messages > > > > I agree. Good idea. > > > > On Thu, Dec 18, 2025 at 8:20 AM Mike Carey <[email protected]> wrote: > > > > > Sounds like something we kinda need to do - brave new world... > > > > > > On 12/17/25 11:10 AM, Ian Maxon wrote: > > > > Hey folks, > > > > > > > > I wanted to propose an addition to the usual commit message header > > > > that we use today, which looks like this: > > > > > > > >> [ASTERIXDB-$ISSUE][$AREA] $COMMIT_SUMMARY > > > >> > > > >> - user model changes: yes/no > > > >> - storage format changes: yes/no > > > >> - interface changes: yes/no > > > >> > > > >> Details: > > > > I think that we should add a field called "generatively assisted" and > > > > if it is yes, there should be a footer in the commit message called > > > > "Generated-by :" that lists the tool(s) used. We should also check > > > > that this tool's output isn't restricted in some way that would be > > > > incompatible with the guidance in > > > > https://www.apache.org/legal/generative-tooling.html. I think > > > > generally there aren't many tools out there that would run against > > > > this. The main thing to be aware of is if it's regurgitating code > > > > that's clearly part of the training data (and that code doesn't have a > > > > clear and compatible license) or if the tool itself somehow says the > > > > code it outputs is not yours and can't be licensed as you wish. The > > > > idea about the footer itself isn't mine, it's from that document. It > > > > seems like a fine one to me. > > > > > > > > Thoughts? > > > > > > > > -Ian
