This is an automated email from the ASF dual-hosted git repository.

wusheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/skywalking-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 7ab57b786c4 Add blog post: Agentic Vibe Coding in a Mature OSS Project
7ab57b786c4 is described below

commit 7ab57b786c403c3d174a05ec2fbbefe4a13dc2b0
Author: Wu Sheng <[email protected]>
AuthorDate: Sun Mar 8 23:40:18 2026 +0800

    Add blog post: Agentic Vibe Coding in a Mature OSS Project
---
 .../blog/2026-03-08-agentic-vibe-coding/index.md   | 81 ++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/content/blog/2026-03-08-agentic-vibe-coding/index.md 
b/content/blog/2026-03-08-agentic-vibe-coding/index.md
new file mode 100644
index 00000000000..cbf8f1e5bad
--- /dev/null
+++ b/content/blog/2026-03-08-agentic-vibe-coding/index.md
@@ -0,0 +1,81 @@
+---
+title: "Agentic Vibe Coding in a Mature OSS Project: What Worked, What Didn't"
+date: 2026-03-08
+author: Sheng Wu
+description: "What happens when you apply agentic AI coding to a mature 
open-source project with real users, real compatibility contracts, and real 
consequences? 77K lines changed in 5 weeks — here's what I learned."
+tags:
+- AI
+- Agentic Coding
+- Vibe Coding
+- TDD
+- Architecture
+- Engineering Productivity
+- Open Source
+endTime: 2026-03-08T23:00:00Z
+---
+
+Most "vibe coding" stories start with a greenfield project. This one doesn't.
+
+Apache SkyWalking is a 9-year-old observability platform with hundreds of 
production deployments, a complex DSL stack, and an external API surface that 
users have built dashboards, alerting rules, and automation scripts against. 
When I decided to replace the core scripting engine — purging the Groovy 
runtime from four DSL compilers — the constraint wasn't "can AI write the 
code?" It was: "can AI write the code without breaking anything for existing 
users?"
+
+The answer turned out to be yes — **~77,000 lines changed across 10 major PRs 
in about 5 weeks** — but only because the AI was tightly guided by a human who 
understood the project's architecture, its compatibility contracts, and its 
users. This post is about the methodology: what worked, what didn't, and what 
mature open-source maintainers should know before handing their codebase to AI 
agents.
+
+## The Project in Brief
+
+The task was to replace SkyWalking's Groovy-based scripting engines (MAL, LAL, 
Hierarchy) with a unified ANTLR4 + Javassist bytecode compilation pipeline, 
matching the architecture already proven by the OAL compiler. The internal tech 
stack was completely overhauled; the external interface had to remain identical.
+
+Beyond the compiler rewrites, the scope included a new queue infrastructure 
(threads dropped from 36 to 15), virtual thread support for JDK 25+, and E2E 
test modernization. By conventional estimates, this was 5-8 months of senior 
engineer work.
+
+For the full technical details on the compiler architecture, see the [Groovy 
elimination discussion](https://github.com/apache/skywalking/discussions/13716).
+
+## What is Agentic Vibe Coding?
+
+"Vibe coding" — a term coined by Andrej Karpathy — describes a style of 
programming where you describe intent and let AI write the code. It's powerful 
for prototyping, but on its own, it's risky for production systems.
+
+**Agentic vibe coding** takes this further: instead of a single AI 
autocomplete, you orchestrate multiple AI agents — each with different 
strengths — under your architectural direction, with automated tests as the 
safety net. In my workflow:
+
+- **Claude Code (plan mode)**: Primary coding agent. Plan mode lets me review 
the approach before any code is generated. This is critical for architectural 
decisions — I steer the design, Claude handles the implementation.
+- **Gemini**: Code review, concurrency analysis, and verification reports. 
Gemini reviewed every major PR for thread-safety, feature parity, and edge 
cases.
+- **Codex**: Autonomous task execution for well-defined, bounded work items.
+
+The key insight: **AI writes the code, but the architect owns the design.** 
Without deep domain knowledge of SkyWalking's internals, no AI could have 
planned these changes. Without AI, I couldn't have executed them in 5 weeks.
+
+## How TDD Made AI Coding Safe
+
+The reason I could move this fast without breaking things comes down to one 
principle: **never let AI code without a test harness.**
+
+My workflow for each major change:
+
+1. **Plan mode first**: Describe the goal to Claude, review the plan, iterate 
on architecture before any code is written.
+2. **Write the test contract**: Define what "correct" means — for the compiler 
rewrites, this meant cross-version comparison tests that run every expression 
through both the old and new engines, asserting identical results across 1,290+ 
expressions.
+3. **Let AI implement**: With the test contract in place, Claude can write 
thousands of lines of implementation code. If it's wrong, the tests catch it 
immediately.
+4. **E2E as the final gate**: Every PR must pass the full E2E test suite — 
Docker-based integration tests that boot the entire server with real storage 
backends.
+5. **AI code review**: Gemini reviewed each PR for concurrency issues, 
thread-safety, and feature parity — catching things that unit tests alone 
wouldn't find.
+
+This is the opposite of "hope it works" vibe coding. The AI writes fast, the 
tests verify fast, and I steer the architecture. The feedback loop is tight 
enough that I can iterate on complex compiler code in minutes instead of days.
+
+## Lessons Learned
+
+**AI is a force multiplier, not a replacement.** Before any AI agent wrote a 
single line, a human had to define the replacement solution: *what* gets 
replaced, *how* it gets replaced, and — critically — *where the boundaries 
are*. Which APIs could break? The internal compilation pipeline was fair game 
for a complete overhaul. Which APIs must stay aligned? Every external-facing 
DSL syntax, every YAML configuration key, every metrics name and tag structure 
had to remain byte-for-byte ident [...]
+
+**Plan mode is non-negotiable for architectural work.** Letting AI jump 
straight to code on a compiler rewrite would be a disaster. Plan mode's 
strength is that it collects code context — scanning imports, tracing call 
chains, mapping class hierarchies — and uses that context to help me fill in 
implementation details I'd otherwise have to look up manually. But it can't 
tell you the design principles. That direction had to come from me, stated 
clearly upfront, so the AI's planning stayed  [...]
+
+**Know when to hit ESC.** Claude has a clear tendency to dive deep into 
solution code writing once it starts — and it won't stop on its own when it 
encounters something that conflicts with the original plan's concept. Instead 
of pausing to flag the conflict, it will push forward, improvising around the 
obstacle in ways that silently violate the design intent. I had to learn to 
watch for this: when Claude's output started drifting from the plan, I'd 
manually cancel the task (ESC), call it [...]
+
+**Spec-driven testing is necessary but not sufficient — the logic workflow 
matters more.** It's tempting to think that if you define the input/output spec 
clearly enough, AI can fill in the implementation and tests will catch any 
mistakes. I tried this. It doesn't work for anything non-trivial. During the 
expression compiler rewrite, Claude would sometimes change code in unreasonable 
ways just to make the spec tests pass — the inputs went in, the expected 
outputs came out, and everything [...]
+
+**Testing at two levels kept the rewrite honest.** Cross-version testing was 
part of my design plan from the start — I architected the dual-path comparison 
framework so that every production DSL expression runs through both the old and 
new engines, asserting identical results across 1,290+ expressions. This gave 
me confidence no human review could match, and it was a deliberate planning 
decision: I knew AI-generated compiler code needed a mechanical proof of 
behavioral equivalence, not j [...]
+
+**Multiple AIs have different strengths.** Claude excels at large-scale code 
generation with plan mode. Gemini is exceptional at logic review — it can 
mentally trace code branches with given input data, simulating execution 
without actually running the code. This is significant for reviewing 
AI-generated code: Gemini would walk through a generated compiler method step 
by step, flagging where a null check was missing or where a branch would 
produce wrong output for a specific edge case. C [...]
+
+**The Mythical Man-Month still applies — and so does the Mythical 
Token-Month.** Brooks taught us that a task requiring 12 person-months does not 
mean 12 people can finish it in one month. The same law applies to AI: you 
cannot simply throw more tokens, more agents, or more parallel sessions at a 
problem and expect it to converge faster. Communication costs, coordination 
overhead, requirements analysis, and conceptual integrity — these software 
engineering fundamentals do not disappear j [...]
+
+## The Bigger Picture
+
+The agentic vibe coding approach worked because it combined AI's speed with 
human architectural judgment and automated test discipline. It's not magic — 
it's engineering, accelerated.
+
+Brooks also gave us "No Silver Bullet," and its core distinction matters more 
than ever: software complexity comes in two kinds. **Essential complexity** 
comes from the problem itself — the domain semantics, the behavioral contracts, 
the concurrency invariants. No tool can eliminate this; it must be understood, 
modeled, and reasoned about by someone who knows the domain. **Accidental 
complexity** comes from the tools and implementation — boilerplate code, manual 
refactoring across hundre [...]
+
+Qian Xuesen(Tsien Hsue-shen)'s *Engineering Cybernetics* offers another lens 
that proved surprisingly relevant. His core framework — **feedback**, 
**control**, **optimization** — describes how to keep complex systems running 
toward their target. AI vibe coding at full speed is like a hypersonic missile: 
extraordinarily fast, but without a guidance system it just creates a bigger 
crater in the wrong place. The feedback loop in my workflow was the test 
harness — cross-version tests and E2E [...]
+
+For more details or to share your own experience with agentic coding on 
production systems, feel free to reach me on 
[GitHub](https://github.com/wu-sheng).

Reply via email to