Good points Peeyush. I can't really think of a perfect answer. The best one I can think of is simply to leave it at the committer's discretion based on some general guidelines.
For example, determining if the output is simply a regurgitation of someone else's code is not easy. However, certainly sometimes it is obvious. Things like very specific and unrelated comments, or unrelated code that has method or variable names that give it away. For those cases, we can clearly say the code is not a synthesis of the prompt and training data as a whole. Basically, you know it when you see it, but if you don't, then it's fine. I think tab completions are also another case where judgement can apply. Certainly tab completion is an old tool, and technically completions done by non-LLM methods could fall into the same traps by using snippets and macros. However an obvious next line of code is not something most people would call an original or creative work. So in those cases, I would judge it's not worth mentioning. If it's generating entire methods and classes, then it's probably worth mentioning, because that is more substantive output from the tool, not just an obvious addition. Committers are already trusted to give proper attribution to code they commit, so overall I think this is just a corollary to that. On Thu, Dec 18, 2025 at 10:08 AM Peeyush Gupta <[email protected]> wrote: > > Sounds like I good idea to me but needs more clarification. > > > 1. > How to find out if the output of the LLM tool is part of its training data. > 2. > I use, LLM based tools for almost all patches for code completion. Sometimes > the code completion could be just a few words. > Do we need to include “Generated-by” in such cases as well? If yes, won’t it > make almost all commits to have this field set. > > From: [email protected] <[email protected]> > Date: Thursday, December 18, 2025 at 10:01 AM > To: [email protected] <[email protected]> > Subject: Re: [DISCUSS] Adding information about AI usage into commit messages > > I agree. Good idea. > > On Thu, Dec 18, 2025 at 8:20 AM Mike Carey <[email protected]> wrote: > > > Sounds like something we kinda need to do - brave new world... > > > > On 12/17/25 11:10 AM, Ian Maxon wrote: > > > Hey folks, > > > > > > I wanted to propose an addition to the usual commit message header > > > that we use today, which looks like this: > > > > > >> [ASTERIXDB-$ISSUE][$AREA] $COMMIT_SUMMARY > > >> > > >> - user model changes: yes/no > > >> - storage format changes: yes/no > > >> - interface changes: yes/no > > >> > > >> Details: > > > I think that we should add a field called "generatively assisted" and > > > if it is yes, there should be a footer in the commit message called > > > "Generated-by :" that lists the tool(s) used. We should also check > > > that this tool's output isn't restricted in some way that would be > > > incompatible with the guidance in > > > https://www.apache.org/legal/generative-tooling.html. I think > > > generally there aren't many tools out there that would run against > > > this. The main thing to be aware of is if it's regurgitating code > > > that's clearly part of the training data (and that code doesn't have a > > > clear and compatible license) or if the tool itself somehow says the > > > code it outputs is not yours and can't be licensed as you wish. The > > > idea about the footer itself isn't mine, it's from that document. It > > > seems like a fine one to me. > > > > > > Thoughts? > > > > > > -Ian
