Hi Tian, Haha, point taken! Trust me, after contributing to Apache Spark since 2014, the last thing I’m looking for here is "promotion." If I wanted that, I’d be heading to Y Combinator, not the dev list!
Jokes aside, I understand why it might have come across that way. I open-sourced the aiv-integrity-gate logic simply to provide a working prototype so we could discuss a concrete technical solution rather than a theoretical one. *To address your concerns:* - Community Ownership: My proposal is to bring this logic directly into the Spark dev/ directory. It will be a native Spark utility, owned and maintained by the community, exactly like structured_logging_style.py. - Maintenance: By using standard libraries like tree-sitter-scala, we minimize the "heavy lifting" of parsing. The logic we would maintain is specifically the Spark-specific architectural rules, which I believe is a worthwhile trade-off to save committers from manual "slop" reviews. - The Data: You are 100% right that we need data. As outlined in the SIP, I am preparing an Offline Baseline run against 100+ historical Spark PRs. I will share those results with the list so we can see the actual false-positive rate before moving forward with any CI integration. I’m looking to solve a problem I see growing in the ecosystem, and I’d value your help in tuning the prototype to make sure it’s a "win" for our maintainers. Best regards, Viquar Khan On Wed, 18 Mar 2026 at 03:25, vaquar khan <[email protected]> wrote: > Hi Jungtaek, > > Thank you for these points. Your concern regarding *accuracy and reviewer > overhead* is perhaps the most impactful feedback I’ve received so far. I > completely agree: if an automated tool has a high false-positive rate, it > creates a "validation tax" that makes a reviewer's job harder, not easier. > > Because your questions get to the heart of the proposal’s viability, I > have specifically documented the answers and data regarding accuracy and > your "validate before integrate" suggestion directly into the SIP: *[Link > to SIP: PR Quality & AI-Generated Content Policy]*. > > To summarize the strategy I've outlined there to address your concerns: > > 1. > > *The "Linter" Strategy:* We are not using subjective "guesses" to > identify AI. We are looking for objective metadata violations;such as > missing JIRA IDs, ignored PR templates, and specific automated signatures. > These are "binary" failures with a near-zero false-positive rate, much like > a code linter. > 2. > > *Shadow Mode (Validation without Integration):* To your point about > figuring out the value first, I propose we run this logic in *Shadow > Mode*. It would run as a non-blocking background process to collect > accurate data on Spark PRs for a set period (e.g., 4 weeks). This allows us > to prove the value and measure the false-positive rate without adding a > single second of overhead to your current review process. > 3. > > *Proactive vs. Reactive:* While testing on other projects is possible, > Spark’s unique standards mean we need Spark-specific data. This proactive > approach ensures we have the tools ready before the volume of "AI slop" > becomes a crisis. > > I’ve made sure the SIP now reflects that the goal of this tool is to act > as a *shield* for committers, not a new hurdle. I’d value your thoughts > on the "Shadow Mode" data collection as a way to provide the proof of > accuracy you’re looking for. > > Please read the details in the SIP doc with your name. > > Best regards, > > Viquar Khan > > On Wed, 18 Mar 2026 at 03:17, vaquar khan <[email protected]> wrote: > >> Hi Holden, >> >> I appreciate the perspective on keeping a human in the loop. However, >> relying on "massive examples" as a lagging indicator means we only act once >> maintainers are already overwhelmed. Data across the ecosystem shows that >> the transition from a manageable queue to an unmanageable flood happens >> rapidly; if Spark is not heavily impacted today, the trajectory of sibling >> projects suggests we will be within 6 months. >> >> The "human in the loop" approach is already costing us time. We are >> seeing drive-by AI contributions that bypass our soft controls and require >> manual intervention to close. For example: >> >> - >> >> *Large-Scale Noise:* PR #52218 >> <https://github.com/apache/spark/pull/52218> introduced 1,151 lines >> of a RabbitMQ connector explicitly marked as "Generated-by: ChatGPT-5," >> lacking tests and ignoring architectural standards. >> - >> >> *Duplicate Overhead:* PR #54810 >> <https://github.com/apache/spark/pull/54810> and PR #54717 >> <https://github.com/apache/spark/pull/54717> are concrete instances >> of AI-driven duplicate PRs for the same JIRA ticket, showing a lack of >> context awareness. >> - >> >> *Template Evasion:* PR #54150 >> <https://github.com/apache/spark/pull/54150> and PR #50400 >> <https://github.com/apache/spark/pull/50400> completely ignored JIRA >> IDs and PR templates without disclosing AI usage. This proves the >> voluntary >> checkbox is an unreliable metric for the true volume of AI code entering >> the repo. >> >> It is important to distinguish this "AI slop" from high-quality, >> productive AI use. As I mentioned, PR #54300 >> <https://github.com/apache/spark/pull/54300> from *Dongjoon Hyun* (using >> Gemini 3 Pro on Antigravity) is a perfect example of how AI should be >> used—with PMC-level oversight and intent. >> >> I have documented these emerging patterns in the SIP. If we look at the >> data, it is clear we are moving toward the same crisis seen in other >> projects. This proposal is a *proactive approach* to protect our >> committers’ bandwidth before the flood arrives, rather than a *reactive* >> one that forces us to scramble once the review queue is already broken. >> >> If a full "auto-close" feels too aggressive right now, could we at least >> implement *automated labeling* based on these SIP patterns to reduce >> "discovery time" for the PMC? >> >> Regards, >> >> Viquar Khan >> >> On Wed, 18 Mar 2026 at 03:08, vaquar khan <[email protected]> wrote: >> >>> Hi everyone, >>> >>> Thank you all for taking the time to review and respond to my email, >>> especially on what I know is a busy Monday. >>> >>> Before diving into the specifics, I want to share a bit of my >>> background. I am an AI developer building various AI products, which gives >>> me a clear perspective on both its pros and cons. I am a strong advocate >>> for using AI and rely on it heavily in my day-to-day life. >>> >>> On that note, I was happy to see our PMC member, Dongjoon Hyun—who >>> requested evidence—is also actively utilizing AI. Specifically, PR #54300 >>> uses "Gemini 3 Pro (High) on Antigravity" (GitHub Link >>> <https://github.com/apache/spark/pull/54300>). I want to emphasize that >>> this is perfectly acceptable; it is a great example of productive AI use >>> rather than "AI slop." >>> >>> *Because there are many questions to cover, I won't overwhelm you by >>> answering them all in a single thread. Instead, I will send multiple >>> follow-up emails to ensure I address each point thoroughly. For a few of >>> the more complex questions, the answers were quite long, so I have >>> documented them directly in the SIP.* >>> >>> Thanks again for your time and feedback >>> >>> Regards, >>> >>> Viquar Khan >>> >>> On Tue, 17 Mar 2026 at 18:10, Jungtaek Lim <[email protected]> >>> wrote: >>> >>>> Personally I would love to ask Vaquar to run the idea against OSS >>>> projects and figure out the value, rather than trying to integrate first >>>> and validate. I do not see a limitation to run the idea without actual >>>> integration - the only issue is the cost, but I hope he can get some help >>>> from his employer if this is ever useful. While it will take multiple >>>> months to collect the useful info from Apache Spark, it shouldn't need >>>> multiple months if it's expanded to so many OSS projects and it will be >>>> much more useful than trying to frame that Apache Spark project would need >>>> this. >>>> >>>> On Wed, Mar 18, 2026 at 7:32 AM Holden Karau <[email protected]> >>>> wrote: >>>> >>>>> I think for now we should probably avoid adding automated closing of >>>>> possible AI PRs, I think we are not as badly impacted (knock on wood) as >>>>> some projects and having a human in the loop for closing is reasonable. If >>>>> we start getting a bunch of seemingly openclaw generated PRs then we can >>>>> revisit this. >>>>> >>>>> On Tue, Mar 17, 2026 at 3:07 PM Jungtaek Lim < >>>>> [email protected]> wrote: >>>>> >>>>>> Maybe my biggest worry for this kind of attempt is the accuracy. If >>>>>> this gives false positives, this will just add overhead on the review >>>>>> phase >>>>>> pushing the reviewer to check the validation manually, which is >>>>>> "additional" overhead. I wouldn't be happy with it if I get another phase >>>>>> in addition to the current review process. >>>>>> >>>>>> We get AI slop exactly because of the accuracy. How is this battle >>>>>> tested? Do you have a proof of the accuracy? Linter failures are almost >>>>>> obvious and there are really rare false positives (at least I haven't >>>>>> seen >>>>>> it), so I don't bother with linter checking. I would bother with an >>>>>> additional process if that does not guarantee (or at least has a sense >>>>>> of) >>>>>> the accuracy. >>>>>> >>>>>> On Wed, Mar 18, 2026 at 6:23 AM vaquar khan <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Team, >>>>>>> >>>>>>> Nowadays a really hot topic in all Apache Projects is AI and I >>>>>>> wanted to kick off a discussion around a new SPIP.I've been putting >>>>>>> together. With the sheer volume of contributions we handle, relying >>>>>>> entirely on PR templates and manual review to filter out AI-generated >>>>>>> slop >>>>>>> is just burning out maintainers. We've seen other projects like curl and >>>>>>> Airflow get completely hammered by this stuff lately, and I think we >>>>>>> need a >>>>>>> hard technical defense. >>>>>>> >>>>>>> I'm proposing the Automated Integrity Validation (AIV) Gate. >>>>>>> Basically, it's a local CI job that parses the AST of a PR (using >>>>>>> Python, >>>>>>> jAST, and tree-sitter-scala) to catch submissions that are mostly empty >>>>>>> scaffolding or violate our specific design rules (like missing.stop() >>>>>>> calls >>>>>>> or using Await.result). >>>>>>> >>>>>>> To keep our pipeline completely secure from CI supply chain attacks, >>>>>>> this runs 100% locally in our dev/ directory;zero external API calls. >>>>>>> If >>>>>>> the tooling ever messes up or a committer needs to force a hotfix, you >>>>>>> can >>>>>>> just bypass it instantly with a GPG-signed commit containing '/aiv >>>>>>> skip'. >>>>>>> >>>>>>> I think the safest way to roll this out without disrupting anyone's >>>>>>> workflow is starting it in a non-blocking "Shadow Mode" just to gather >>>>>>> data >>>>>>> and tune the thresholds. >>>>>>> >>>>>>> I've attached the full SPIP draft below which dives into all the >>>>>>> technical weeds, the rollout plan, and a FAQ. Would love to hear your >>>>>>> thoughts! >>>>>>> >>>>>>> >>>>>>> https://docs.google.com/document/d/1-PCSq0PT_B45MbXVxkJ_E3GUHvK-8VV6WxQjKSGEh9o/edit?tab=t.0#heading=h.e8ahm4jtqclh >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> Viquar Khan >>>>>>> *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/ >>>>>>> *Book *- >>>>>>> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true >>>>>>> *GitBook*- >>>>>>> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/ >>>>>>> *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan >>>>>>> *github*-https://github.com/vaquarkhan/aiv-integrity-gate >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>
