Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

vaquar khan Wed, 18 Mar 2026 01:47:25 -0700

Hi Tian,

Haha, point taken! Trust me, after contributing to Apache Spark since 2014,
the last thing I’m looking for here is "promotion." If I wanted that, I’d
be heading to Y Combinator, not the dev list!


Jokes aside, I understand why it might have come across that way. I
open-sourced the aiv-integrity-gate logic simply to provide a working
prototype so we could discuss a concrete technical solution rather than a
theoretical one.


*To address your concerns:*
- Community Ownership: My proposal is to bring this logic directly into the
Spark dev/ directory. It will be a native Spark utility, owned and
maintained by the community, exactly like structured_logging_style.py.

- Maintenance: By using standard libraries like tree-sitter-scala, we
minimize the "heavy lifting" of parsing. The logic we would maintain is
specifically the Spark-specific architectural rules, which I believe is a
worthwhile trade-off to save committers from manual "slop" reviews.

- The Data: You are 100% right that we need data. As outlined in the SIP, I
am preparing an Offline Baseline run against 100+ historical Spark PRs. I
will share those results with the list so we can see the actual
false-positive rate before moving forward with any CI integration.

I’m looking to solve a problem I see growing in the ecosystem, and I’d
value your help in tuning the prototype to make sure it’s a "win" for our
maintainers.

Best regards,
Viquar Khan

On Wed, 18 Mar 2026 at 03:25, vaquar khan <[email protected]> wrote:

> Hi Jungtaek,
>
> Thank you for these points. Your concern regarding *accuracy and reviewer
> overhead* is perhaps the most impactful feedback I’ve received so far. I
> completely agree: if an automated tool has a high false-positive rate, it
> creates a "validation tax" that makes a reviewer's job harder, not easier.
>
> Because your questions get to the heart of the proposal’s viability, I
> have specifically documented the answers and data regarding accuracy and
> your "validate before integrate" suggestion directly into the SIP: *[Link
> to SIP: PR Quality & AI-Generated Content Policy]*.
>
> To summarize the strategy I've outlined there to address your concerns:
>
>    1.
>
>    *The "Linter" Strategy:* We are not using subjective "guesses" to
>    identify AI. We are looking for objective metadata violations;such as
>    missing JIRA IDs, ignored PR templates, and specific automated signatures.
>    These are "binary" failures with a near-zero false-positive rate, much like
>    a code linter.
>    2.
>
>    *Shadow Mode (Validation without Integration):* To your point about
>    figuring out the value first, I propose we run this logic in *Shadow
>    Mode*. It would run as a non-blocking background process to collect
>    accurate data on Spark PRs for a set period (e.g., 4 weeks). This allows us
>    to prove the value and measure the false-positive rate without adding a
>    single second of overhead to your current review process.
>    3.
>
>    *Proactive vs. Reactive:* While testing on other projects is possible,
>    Spark’s unique standards mean we need Spark-specific data. This proactive
>    approach ensures we have the tools ready before the volume of "AI slop"
>    becomes a crisis.
>
> I’ve made sure the SIP now reflects that the goal of this tool is to act
> as a *shield* for committers, not a new hurdle. I’d value your thoughts
> on the "Shadow Mode" data collection as a way to provide the proof of
> accuracy you’re looking for.
>
> Please read the details in the SIP doc with your name.
>
> Best regards,
>
> Viquar Khan
>
> On Wed, 18 Mar 2026 at 03:17, vaquar khan <[email protected]> wrote:
>
>> Hi Holden,
>>
>> I appreciate the perspective on keeping a human in the loop. However,
>> relying on "massive examples" as a lagging indicator means we only act once
>> maintainers are already overwhelmed. Data across the ecosystem shows that
>> the transition from a manageable queue to an unmanageable flood happens
>> rapidly; if Spark is not heavily impacted today, the trajectory of sibling
>> projects suggests we will be within 6 months.
>>
>> The "human in the loop" approach is already costing us time. We are
>> seeing drive-by AI contributions that bypass our soft controls and require
>> manual intervention to close. For example:
>>
>>    -
>>
>>    *Large-Scale Noise:* PR #52218
>>    <https://github.com/apache/spark/pull/52218> introduced 1,151 lines
>>    of a RabbitMQ connector explicitly marked as "Generated-by: ChatGPT-5,"
>>    lacking tests and ignoring architectural standards.
>>    -
>>
>>    *Duplicate Overhead:* PR #54810
>>    <https://github.com/apache/spark/pull/54810> and PR #54717
>>    <https://github.com/apache/spark/pull/54717> are concrete instances
>>    of AI-driven duplicate PRs for the same JIRA ticket, showing a lack of
>>    context awareness.
>>    -
>>
>>    *Template Evasion:* PR #54150
>>    <https://github.com/apache/spark/pull/54150> and PR #50400
>>    <https://github.com/apache/spark/pull/50400> completely ignored JIRA
>>    IDs and PR templates without disclosing AI usage. This proves the 
>> voluntary
>>    checkbox is an unreliable metric for the true volume of AI code entering
>>    the repo.
>>
>> It is important to distinguish this "AI slop" from high-quality,
>> productive AI use. As I mentioned, PR #54300
>> <https://github.com/apache/spark/pull/54300> from *Dongjoon Hyun* (using
>> Gemini 3 Pro on Antigravity) is a perfect example of how AI should be
>> used—with PMC-level oversight and intent.
>>
>> I have documented these emerging patterns in the SIP. If we look at the
>> data, it is clear we are moving toward the same crisis seen in other
>> projects. This proposal is a *proactive approach* to protect our
>> committers’ bandwidth before the flood arrives, rather than a *reactive*
>> one that forces us to scramble once the review queue is already broken.
>>
>> If a full "auto-close" feels too aggressive right now, could we at least
>> implement *automated labeling* based on these SIP patterns to reduce
>> "discovery time" for the PMC?
>>
>> Regards,
>>
>> Viquar Khan
>>
>> On Wed, 18 Mar 2026 at 03:08, vaquar khan <[email protected]> wrote:
>>
>>> Hi everyone,
>>>
>>> Thank you all for taking the time to review and respond to my email,
>>> especially on what I know is a busy Monday.
>>>
>>> Before diving into the specifics, I want to share a bit of my
>>> background. I am an AI developer building various AI products, which gives
>>> me a clear perspective on both its pros and cons. I am a strong advocate
>>> for using AI and rely on it heavily in my day-to-day life.
>>>
>>> On that note, I was happy to see our PMC member, Dongjoon Hyun—who
>>> requested evidence—is also actively utilizing AI. Specifically, PR #54300
>>> uses "Gemini 3 Pro (High) on Antigravity" (GitHub Link
>>> <https://github.com/apache/spark/pull/54300>). I want to emphasize that
>>> this is perfectly acceptable; it is a great example of productive AI use
>>> rather than "AI slop."
>>>
>>> *Because there are many questions to cover, I won't overwhelm you by
>>> answering them all in a single thread. Instead, I will send multiple
>>> follow-up emails to ensure I address each point thoroughly. For a few of
>>> the more complex questions, the answers were quite long, so I have
>>> documented them directly in the SIP.*
>>>
>>> Thanks again for your time and feedback
>>>
>>> Regards,
>>>
>>> Viquar Khan
>>>
>>> On Tue, 17 Mar 2026 at 18:10, Jungtaek Lim <[email protected]>
>>> wrote:
>>>
>>>> Personally I would love to ask Vaquar to run the idea against OSS
>>>> projects and figure out the value, rather than trying to integrate first
>>>> and validate. I do not see a limitation to run the idea without actual
>>>> integration - the only issue is the cost, but I hope he can get some help
>>>> from his employer if this is ever useful. While it will take multiple
>>>> months to collect the useful info from Apache Spark, it shouldn't need
>>>> multiple months if it's expanded to so many OSS projects and it will be
>>>> much more useful than trying to frame that Apache Spark project would need
>>>> this.
>>>>
>>>> On Wed, Mar 18, 2026 at 7:32 AM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> I think for now we should probably avoid adding automated closing of
>>>>> possible AI PRs, I think we are not as badly impacted (knock on wood) as
>>>>> some projects and having a human in the loop for closing is reasonable. If
>>>>> we start getting a bunch of seemingly openclaw generated PRs then we can
>>>>> revisit this.
>>>>>
>>>>> On Tue, Mar 17, 2026 at 3:07 PM Jungtaek Lim <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Maybe my biggest worry for this kind of attempt is the accuracy. If
>>>>>> this gives false positives, this will just add overhead on the review 
>>>>>> phase
>>>>>> pushing the reviewer to check the validation manually, which is
>>>>>> "additional" overhead. I wouldn't be happy with it if I get another phase
>>>>>> in addition to the current review process.
>>>>>>
>>>>>> We get AI slop exactly because of the accuracy. How is this battle
>>>>>> tested? Do you have a proof of the accuracy? Linter failures are almost
>>>>>> obvious and there are really rare false positives (at least I haven't 
>>>>>> seen
>>>>>> it), so I don't bother with linter checking. I would bother with an
>>>>>> additional process if that does not guarantee (or at least has a sense 
>>>>>> of)
>>>>>> the accuracy.
>>>>>>
>>>>>> On Wed, Mar 18, 2026 at 6:23 AM vaquar khan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>>  Nowadays a really hot topic in all Apache Projects is AI and I
>>>>>>> wanted to kick off a discussion around a new SPIP.I've been putting
>>>>>>> together. With the sheer volume of contributions we handle, relying
>>>>>>> entirely on PR templates and manual review to filter out AI-generated 
>>>>>>> slop
>>>>>>> is just burning out maintainers. We've seen other projects like curl and
>>>>>>> Airflow get completely hammered by this stuff lately, and I think we 
>>>>>>> need a
>>>>>>> hard technical defense.
>>>>>>>
>>>>>>> I'm proposing the Automated Integrity Validation (AIV) Gate.
>>>>>>> Basically, it's a local CI job that parses the AST of a PR (using 
>>>>>>> Python,
>>>>>>> jAST, and tree-sitter-scala) to catch submissions that are mostly empty
>>>>>>> scaffolding or violate our specific design rules (like missing.stop() 
>>>>>>> calls
>>>>>>> or using Await.result).
>>>>>>>
>>>>>>> To keep our pipeline completely secure from CI supply chain attacks,
>>>>>>> this runs 100% locally in our dev/ directory;zero external API calls.  
>>>>>>> If
>>>>>>> the tooling ever messes up or a committer needs to force a hotfix, you 
>>>>>>> can
>>>>>>> just bypass it instantly with a GPG-signed commit containing '/aiv 
>>>>>>> skip'.
>>>>>>>
>>>>>>> I think the safest way to roll this out without disrupting anyone's
>>>>>>> workflow is starting it in a non-blocking "Shadow Mode" just to gather 
>>>>>>> data
>>>>>>> and tune the thresholds.
>>>>>>>
>>>>>>> I've attached the full SPIP draft below which dives into all the
>>>>>>> technical weeds, the rollout plan, and a FAQ. Would love to hear your
>>>>>>> thoughts!
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1-PCSq0PT_B45MbXVxkJ_E3GUHvK-8VV6WxQjKSGEh9o/edit?tab=t.0#heading=h.e8ahm4jtqclh
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Viquar Khan
>>>>>>> *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>> *Book *-
>>>>>>> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
>>>>>>> *GitBook*-
>>>>>>> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
>>>>>>> *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
>>>>>>> *github*-https://github.com/vaquarkhan/aiv-integrity-gate
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>

Re: SPIP: Automated Integrity Validation (AIV) Gate for Apache Spark

Reply via email to