Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

Lisa N. Cao Wed, 18 Mar 2026 13:23:01 -0700

Hi all,

I want to share how Apache Airflow is handling this, since they're dealing
with the same volume problem.


Rather than building detection for AI-generated PRs specifically, they've
focused on raising the quality bar for all non-collaborator contributions
and automating the enforcement. The discussion and tooling could provide
inspiration:
https://lists.apache.org/thread/8tzwwwd7jmtmfo4j9pzg27704g10vpr4
https://github.com/apache/airflow/pull/62682

PRs from non-collaborators must pass all checks, follow PR guidelines
(LLM-verified), and include proper descriptions and tests before any
maintainer looks at them. PRs that don't meet the bar get converted to
drafts automatically with actionable comments. A human triager reviews the
automated output, but the responsibility sits entirely with the author. I
don't see it as all that different from the goals of this SPIP.

Their Gen-AI disclosure policy layers on top of this:
https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions

Could be a useful model as the community weighs what levels of enforcement
are available.


-- 
LNC

On Wed, Mar 18, 2026, 1:48 AM Jungtaek Lim <[email protected]>
wrote:

> Hi Vaquar,
>
> I do not see a value in coupling this with Apache Spark. If this is useful
> for Apache Spark, why is this particularly useful only for Apache Spark? It
> shouldn't be too hard for you to run the prototype with existing/new PRs
> over various OSS projects. Apache Spark project is too restricted to prove
> your project because nowadays code contributors are rather almost bound -
> we are not running the project which is quite new and shiny to gain
> traction from random contributors. I don't feel like we should take the
> approach of shadow mode while it is not really necessary. There is an
> existing way to prove the value; go with a faster loop on your own project
> first.
>
> There is no actual relation between this and Apache Spark "from the
> product point of view". You should become more successful when you prove
> the value with the project. Please incubate properly and in the right
> direction.
>
>
> On Wed, Mar 18, 2026 at 5:26 PM vaquar khan <[email protected]> wrote:
>
>> Hi Jungtaek,
>>
>> Thank you for these points. Your concern regarding *accuracy and
>> reviewer overhead* is perhaps the most impactful feedback I’ve received
>> so far. I completely agree: if an automated tool has a high false-positive
>> rate, it creates a "validation tax" that makes a reviewer's job harder, not
>> easier.
>>
>> Because your questions get to the heart of the proposal’s viability, I
>> have specifically documented the answers and data regarding accuracy and
>> your "validate before integrate" suggestion directly into the SIP: *[Link
>> to SIP: PR Quality & AI-Generated Content Policy]*.
>>
>> To summarize the strategy I've outlined there to address your concerns:
>>
>>    1.
>>
>>    *The "Linter" Strategy:* We are not using subjective "guesses" to
>>    identify AI. We are looking for objective metadata violations;such as
>>    missing JIRA IDs, ignored PR templates, and specific automated signatures.
>>    These are "binary" failures with a near-zero false-positive rate, much 
>> like
>>    a code linter.
>>    2.
>>
>>    *Shadow Mode (Validation without Integration):* To your point about
>>    figuring out the value first, I propose we run this logic in *Shadow
>>    Mode*. It would run as a non-blocking background process to collect
>>    accurate data on Spark PRs for a set period (e.g., 4 weeks). This allows 
>> us
>>    to prove the value and measure the false-positive rate without adding a
>>    single second of overhead to your current review process.
>>    3.
>>
>>    *Proactive vs. Reactive:* While testing on other projects is
>>    possible, Spark’s unique standards mean we need Spark-specific data. This
>>    proactive approach ensures we have the tools ready before the volume of 
>> "AI
>>    slop" becomes a crisis.
>>
>> I’ve made sure the SIP now reflects that the goal of this tool is to act
>> as a *shield* for committers, not a new hurdle. I’d value your thoughts
>> on the "Shadow Mode" data collection as a way to provide the proof of
>> accuracy you’re looking for.
>>
>> Please read the details in the SIP doc with your name.
>>
>> Best regards,
>>
>> Viquar Khan
>>
>> On Wed, 18 Mar 2026 at 03:17, vaquar khan <[email protected]> wrote:
>>
>>> Hi Holden,
>>>
>>> I appreciate the perspective on keeping a human in the loop. However,
>>> relying on "massive examples" as a lagging indicator means we only act once
>>> maintainers are already overwhelmed. Data across the ecosystem shows that
>>> the transition from a manageable queue to an unmanageable flood happens
>>> rapidly; if Spark is not heavily impacted today, the trajectory of sibling
>>> projects suggests we will be within 6 months.
>>>
>>> The "human in the loop" approach is already costing us time. We are
>>> seeing drive-by AI contributions that bypass our soft controls and require
>>> manual intervention to close. For example:
>>>
>>>    -
>>>
>>>    *Large-Scale Noise:* PR #52218
>>>    <https://github.com/apache/spark/pull/52218> introduced 1,151 lines
>>>    of a RabbitMQ connector explicitly marked as "Generated-by: ChatGPT-5,"
>>>    lacking tests and ignoring architectural standards.
>>>    -
>>>
>>>    *Duplicate Overhead:* PR #54810
>>>    <https://github.com/apache/spark/pull/54810> and PR #54717
>>>    <https://github.com/apache/spark/pull/54717> are concrete instances
>>>    of AI-driven duplicate PRs for the same JIRA ticket, showing a lack of
>>>    context awareness.
>>>    -
>>>
>>>    *Template Evasion:* PR #54150
>>>    <https://github.com/apache/spark/pull/54150> and PR #50400
>>>    <https://github.com/apache/spark/pull/50400> completely ignored JIRA
>>>    IDs and PR templates without disclosing AI usage. This proves the 
>>> voluntary
>>>    checkbox is an unreliable metric for the true volume of AI code entering
>>>    the repo.
>>>
>>> It is important to distinguish this "AI slop" from high-quality,
>>> productive AI use. As I mentioned, PR #54300
>>> <https://github.com/apache/spark/pull/54300> from *Dongjoon Hyun*
>>> (using Gemini 3 Pro on Antigravity) is a perfect example of how AI should
>>> be used—with PMC-level oversight and intent.
>>>
>>> I have documented these emerging patterns in the SIP. If we look at the
>>> data, it is clear we are moving toward the same crisis seen in other
>>> projects. This proposal is a *proactive approach* to protect our
>>> committers’ bandwidth before the flood arrives, rather than a *reactive*
>>> one that forces us to scramble once the review queue is already broken.
>>>
>>> If a full "auto-close" feels too aggressive right now, could we at least
>>> implement *automated labeling* based on these SIP patterns to reduce
>>> "discovery time" for the PMC?
>>>
>>> Regards,
>>>
>>> Viquar Khan
>>>
>>> On Wed, 18 Mar 2026 at 03:08, vaquar khan <[email protected]> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Thank you all for taking the time to review and respond to my email,
>>>> especially on what I know is a busy Monday.
>>>>
>>>> Before diving into the specifics, I want to share a bit of my
>>>> background. I am an AI developer building various AI products, which gives
>>>> me a clear perspective on both its pros and cons. I am a strong advocate
>>>> for using AI and rely on it heavily in my day-to-day life.
>>>>
>>>> On that note, I was happy to see our PMC member, Dongjoon Hyun—who
>>>> requested evidence—is also actively utilizing AI. Specifically, PR #54300
>>>> uses "Gemini 3 Pro (High) on Antigravity" (GitHub Link
>>>> <https://github.com/apache/spark/pull/54300>). I want to emphasize
>>>> that this is perfectly acceptable; it is a great example of productive AI
>>>> use rather than "AI slop."
>>>>
>>>> *Because there are many questions to cover, I won't overwhelm you by
>>>> answering them all in a single thread. Instead, I will send multiple
>>>> follow-up emails to ensure I address each point thoroughly. For a few of
>>>> the more complex questions, the answers were quite long, so I have
>>>> documented them directly in the SIP.*
>>>>
>>>> Thanks again for your time and feedback
>>>>
>>>> Regards,
>>>>
>>>> Viquar Khan
>>>>
>>>> On Tue, 17 Mar 2026 at 18:10, Jungtaek Lim <
>>>> [email protected]> wrote:
>>>>
>>>>> Personally I would love to ask Vaquar to run the idea against OSS
>>>>> projects and figure out the value, rather than trying to integrate first
>>>>> and validate. I do not see a limitation to run the idea without actual
>>>>> integration - the only issue is the cost, but I hope he can get some help
>>>>> from his employer if this is ever useful. While it will take multiple
>>>>> months to collect the useful info from Apache Spark, it shouldn't need
>>>>> multiple months if it's expanded to so many OSS projects and it will be
>>>>> much more useful than trying to frame that Apache Spark project would need
>>>>> this.
>>>>>
>>>>> On Wed, Mar 18, 2026 at 7:32 AM Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I think for now we should probably avoid adding automated closing of
>>>>>> possible AI PRs, I think we are not as badly impacted (knock on wood) as
>>>>>> some projects and having a human in the loop for closing is reasonable. 
>>>>>> If
>>>>>> we start getting a bunch of seemingly openclaw generated PRs then we can
>>>>>> revisit this.
>>>>>>
>>>>>> On Tue, Mar 17, 2026 at 3:07 PM Jungtaek Lim <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Maybe my biggest worry for this kind of attempt is the accuracy. If
>>>>>>> this gives false positives, this will just add overhead on the review 
>>>>>>> phase
>>>>>>> pushing the reviewer to check the validation manually, which is
>>>>>>> "additional" overhead. I wouldn't be happy with it if I get another 
>>>>>>> phase
>>>>>>> in addition to the current review process.
>>>>>>>
>>>>>>> We get AI slop exactly because of the accuracy. How is this battle
>>>>>>> tested? Do you have a proof of the accuracy? Linter failures are almost
>>>>>>> obvious and there are really rare false positives (at least I haven't 
>>>>>>> seen
>>>>>>> it), so I don't bother with linter checking. I would bother with an
>>>>>>> additional process if that does not guarantee (or at least has a sense 
>>>>>>> of)
>>>>>>> the accuracy.
>>>>>>>
>>>>>>> On Wed, Mar 18, 2026 at 6:23 AM vaquar khan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Team,
>>>>>>>>
>>>>>>>>  Nowadays a really hot topic in all Apache Projects is AI and I
>>>>>>>> wanted to kick off a discussion around a new SPIP.I've been putting
>>>>>>>> together. With the sheer volume of contributions we handle, relying
>>>>>>>> entirely on PR templates and manual review to filter out AI-generated 
>>>>>>>> slop
>>>>>>>> is just burning out maintainers. We've seen other projects like curl 
>>>>>>>> and
>>>>>>>> Airflow get completely hammered by this stuff lately, and I think we 
>>>>>>>> need a
>>>>>>>> hard technical defense.
>>>>>>>>
>>>>>>>> I'm proposing the Automated Integrity Validation (AIV) Gate.
>>>>>>>> Basically, it's a local CI job that parses the AST of a PR (using 
>>>>>>>> Python,
>>>>>>>> jAST, and tree-sitter-scala) to catch submissions that are mostly empty
>>>>>>>> scaffolding or violate our specific design rules (like missing.stop() 
>>>>>>>> calls
>>>>>>>> or using Await.result).
>>>>>>>>
>>>>>>>> To keep our pipeline completely secure from CI supply chain
>>>>>>>> attacks, this runs 100% locally in our dev/ directory;zero external API
>>>>>>>> calls.  If the tooling ever messes up or a committer needs to force a
>>>>>>>> hotfix, you can just bypass it instantly with a GPG-signed commit
>>>>>>>> containing '/aiv skip'.
>>>>>>>>
>>>>>>>> I think the safest way to roll this out without disrupting anyone's
>>>>>>>> workflow is starting it in a non-blocking "Shadow Mode" just to gather 
>>>>>>>> data
>>>>>>>> and tune the thresholds.
>>>>>>>>
>>>>>>>> I've attached the full SPIP draft below which dives into all the
>>>>>>>> technical weeds, the rollout plan, and a FAQ. Would love to hear your
>>>>>>>> thoughts!
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1-PCSq0PT_B45MbXVxkJ_E3GUHvK-8VV6WxQjKSGEh9o/edit?tab=t.0#heading=h.e8ahm4jtqclh
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> Viquar Khan
>>>>>>>> *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>>> *Book *-
>>>>>>>> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
>>>>>>>> *GitBook*-
>>>>>>>> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
>>>>>>>> *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
>>>>>>>> *github*-https://github.com/vaquarkhan/aiv-integrity-gate
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> Pronouns: she/her
>>>>>>
>>>>>

Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

Reply via email to