This is an automated email from the ASF dual-hosted git repository.
wusheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/skywalking-website.git
The following commit(s) were added to refs/heads/master by this push:
new 7ab57b786c4 Add blog post: Agentic Vibe Coding in a Mature OSS Project
7ab57b786c4 is described below
commit 7ab57b786c403c3d174a05ec2fbbefe4a13dc2b0
Author: Wu Sheng <[email protected]>
AuthorDate: Sun Mar 8 23:40:18 2026 +0800
Add blog post: Agentic Vibe Coding in a Mature OSS Project
---
.../blog/2026-03-08-agentic-vibe-coding/index.md | 81 ++++++++++++++++++++++
1 file changed, 81 insertions(+)
diff --git a/content/blog/2026-03-08-agentic-vibe-coding/index.md
b/content/blog/2026-03-08-agentic-vibe-coding/index.md
new file mode 100644
index 00000000000..cbf8f1e5bad
--- /dev/null
+++ b/content/blog/2026-03-08-agentic-vibe-coding/index.md
@@ -0,0 +1,81 @@
+---
+title: "Agentic Vibe Coding in a Mature OSS Project: What Worked, What Didn't"
+date: 2026-03-08
+author: Sheng Wu
+description: "What happens when you apply agentic AI coding to a mature
open-source project with real users, real compatibility contracts, and real
consequences? 77K lines changed in 5 weeks — here's what I learned."
+tags:
+- AI
+- Agentic Coding
+- Vibe Coding
+- TDD
+- Architecture
+- Engineering Productivity
+- Open Source
+endTime: 2026-03-08T23:00:00Z
+---
+
+Most "vibe coding" stories start with a greenfield project. This one doesn't.
+
+Apache SkyWalking is a 9-year-old observability platform with hundreds of
production deployments, a complex DSL stack, and an external API surface that
users have built dashboards, alerting rules, and automation scripts against.
When I decided to replace the core scripting engine — purging the Groovy
runtime from four DSL compilers — the constraint wasn't "can AI write the
code?" It was: "can AI write the code without breaking anything for existing
users?"
+
+The answer turned out to be yes — **~77,000 lines changed across 10 major PRs
in about 5 weeks** — but only because the AI was tightly guided by a human who
understood the project's architecture, its compatibility contracts, and its
users. This post is about the methodology: what worked, what didn't, and what
mature open-source maintainers should know before handing their codebase to AI
agents.
+
+## The Project in Brief
+
+The task was to replace SkyWalking's Groovy-based scripting engines (MAL, LAL,
Hierarchy) with a unified ANTLR4 + Javassist bytecode compilation pipeline,
matching the architecture already proven by the OAL compiler. The internal tech
stack was completely overhauled; the external interface had to remain identical.
+
+Beyond the compiler rewrites, the scope included a new queue infrastructure
(threads dropped from 36 to 15), virtual thread support for JDK 25+, and E2E
test modernization. By conventional estimates, this was 5-8 months of senior
engineer work.
+
+For the full technical details on the compiler architecture, see the [Groovy
elimination discussion](https://github.com/apache/skywalking/discussions/13716).
+
+## What is Agentic Vibe Coding?
+
+"Vibe coding" — a term coined by Andrej Karpathy — describes a style of
programming where you describe intent and let AI write the code. It's powerful
for prototyping, but on its own, it's risky for production systems.
+
+**Agentic vibe coding** takes this further: instead of a single AI
autocomplete, you orchestrate multiple AI agents — each with different
strengths — under your architectural direction, with automated tests as the
safety net. In my workflow:
+
+- **Claude Code (plan mode)**: Primary coding agent. Plan mode lets me review
the approach before any code is generated. This is critical for architectural
decisions — I steer the design, Claude handles the implementation.
+- **Gemini**: Code review, concurrency analysis, and verification reports.
Gemini reviewed every major PR for thread-safety, feature parity, and edge
cases.
+- **Codex**: Autonomous task execution for well-defined, bounded work items.
+
+The key insight: **AI writes the code, but the architect owns the design.**
Without deep domain knowledge of SkyWalking's internals, no AI could have
planned these changes. Without AI, I couldn't have executed them in 5 weeks.
+
+## How TDD Made AI Coding Safe
+
+The reason I could move this fast without breaking things comes down to one
principle: **never let AI code without a test harness.**
+
+My workflow for each major change:
+
+1. **Plan mode first**: Describe the goal to Claude, review the plan, iterate
on architecture before any code is written.
+2. **Write the test contract**: Define what "correct" means — for the compiler
rewrites, this meant cross-version comparison tests that run every expression
through both the old and new engines, asserting identical results across 1,290+
expressions.
+3. **Let AI implement**: With the test contract in place, Claude can write
thousands of lines of implementation code. If it's wrong, the tests catch it
immediately.
+4. **E2E as the final gate**: Every PR must pass the full E2E test suite —
Docker-based integration tests that boot the entire server with real storage
backends.
+5. **AI code review**: Gemini reviewed each PR for concurrency issues,
thread-safety, and feature parity — catching things that unit tests alone
wouldn't find.
+
+This is the opposite of "hope it works" vibe coding. The AI writes fast, the
tests verify fast, and I steer the architecture. The feedback loop is tight
enough that I can iterate on complex compiler code in minutes instead of days.
+
+## Lessons Learned
+
+**AI is a force multiplier, not a replacement.** Before any AI agent wrote a
single line, a human had to define the replacement solution: *what* gets
replaced, *how* it gets replaced, and — critically — *where the boundaries
are*. Which APIs could break? The internal compilation pipeline was fair game
for a complete overhaul. Which APIs must stay aligned? Every external-facing
DSL syntax, every YAML configuration key, every metrics name and tag structure
had to remain byte-for-byte ident [...]
+
+**Plan mode is non-negotiable for architectural work.** Letting AI jump
straight to code on a compiler rewrite would be a disaster. Plan mode's
strength is that it collects code context — scanning imports, tracing call
chains, mapping class hierarchies — and uses that context to help me fill in
implementation details I'd otherwise have to look up manually. But it can't
tell you the design principles. That direction had to come from me, stated
clearly upfront, so the AI's planning stayed [...]
+
+**Know when to hit ESC.** Claude has a clear tendency to dive deep into
solution code writing once it starts — and it won't stop on its own when it
encounters something that conflicts with the original plan's concept. Instead
of pausing to flag the conflict, it will push forward, improvising around the
obstacle in ways that silently violate the design intent. I had to learn to
watch for this: when Claude's output started drifting from the plan, I'd
manually cancel the task (ESC), call it [...]
+
+**Spec-driven testing is necessary but not sufficient — the logic workflow
matters more.** It's tempting to think that if you define the input/output spec
clearly enough, AI can fill in the implementation and tests will catch any
mistakes. I tried this. It doesn't work for anything non-trivial. During the
expression compiler rewrite, Claude would sometimes change code in unreasonable
ways just to make the spec tests pass — the inputs went in, the expected
outputs came out, and everything [...]
+
+**Testing at two levels kept the rewrite honest.** Cross-version testing was
part of my design plan from the start — I architected the dual-path comparison
framework so that every production DSL expression runs through both the old and
new engines, asserting identical results across 1,290+ expressions. This gave
me confidence no human review could match, and it was a deliberate planning
decision: I knew AI-generated compiler code needed a mechanical proof of
behavioral equivalence, not j [...]
+
+**Multiple AIs have different strengths.** Claude excels at large-scale code
generation with plan mode. Gemini is exceptional at logic review — it can
mentally trace code branches with given input data, simulating execution
without actually running the code. This is significant for reviewing
AI-generated code: Gemini would walk through a generated compiler method step
by step, flagging where a null check was missing or where a branch would
produce wrong output for a specific edge case. C [...]
+
+**The Mythical Man-Month still applies — and so does the Mythical
Token-Month.** Brooks taught us that a task requiring 12 person-months does not
mean 12 people can finish it in one month. The same law applies to AI: you
cannot simply throw more tokens, more agents, or more parallel sessions at a
problem and expect it to converge faster. Communication costs, coordination
overhead, requirements analysis, and conceptual integrity — these software
engineering fundamentals do not disappear j [...]
+
+## The Bigger Picture
+
+The agentic vibe coding approach worked because it combined AI's speed with
human architectural judgment and automated test discipline. It's not magic —
it's engineering, accelerated.
+
+Brooks also gave us "No Silver Bullet," and its core distinction matters more
than ever: software complexity comes in two kinds. **Essential complexity**
comes from the problem itself — the domain semantics, the behavioral contracts,
the concurrency invariants. No tool can eliminate this; it must be understood,
modeled, and reasoned about by someone who knows the domain. **Accidental
complexity** comes from the tools and implementation — boilerplate code, manual
refactoring across hundre [...]
+
+Qian Xuesen(Tsien Hsue-shen)'s *Engineering Cybernetics* offers another lens
that proved surprisingly relevant. His core framework — **feedback**,
**control**, **optimization** — describes how to keep complex systems running
toward their target. AI vibe coding at full speed is like a hypersonic missile:
extraordinarily fast, but without a guidance system it just creates a bigger
crater in the wrong place. The feedback loop in my workflow was the test
harness — cross-version tests and E2E [...]
+
+For more details or to share your own experience with agentic coding on
production systems, feel free to reach me on
[GitHub](https://github.com/wu-sheng).