Re: request for help: LLM-based quality assurance

Tim Rühsen Tue, 09 Jun 2026 02:01:10 -0700

Hi Bruno

Another experiment with Claude code (closed source agent) connected to
the same local LLM as before found the issue in commit
17dc60e624cd6fc3491f9cb002f760d60e66ce8b:


"Important caveat: The original commit 17dc60e624 introduced a bug in
mbrtowc.c — it replaced MBRTOWC_EMPTY_INPUT_BUG with
MBRTOC32_EMPTY_INPUT_BUG (wrong macro name) and added _GL_SMALL_WCHAR_T
(mbrtoc32-specific). This was fixed by follow-up commit 2ca51a77e6. Both
commits should be evaluated together."


Well, if it knows about the follow-up commit, we don't know whether it
found the regression independently of that.

Good point, it likely like you say. To confirm, I git reset --hard to the commit to be reviewed and Claude + Qwen3.6-35B didn't report this bug any more. I'll (slowly) continue my testing and let you know if I find a usable OSS/open-weight agent+model combination.

It produced a longish result/summary explaining the issue in details.


Too much details are not advisable. Because what is cheaper: Reviewing
a 20-lines patch or reading a 50-lines analysis report?

This is a matter of what you ask for. LLMs even allow you to generate machine-readable output (as JSON).

In general, we should strive for the locally running open-weights models.


I agree.

If it's not locally running, it's SaaS, and what people report is that
the quality slowly gets worse over time, until a new version of the LLM
is released, at which moment the quality rises again - but with unpredictable
effects on the particular workflow.

New Newer models get better in general, but that older models getting worse over time doesn't match my experienced. They are tuned sometimes, and depending on your tasks, you may perceive that as better or worse. Even changes to the underlying hardware may change the results. That's also one of the reasons why you can't get reproducible results for SaaS models.

Occasionally it happens that the closed-source alternative is more usable
than the open-source one. For instance, I did not get a good experience
with GitLab CI (open-source), because I could not find out how to store
large log files in the case of failed builds. Whereas in GitHub the handling
of the log files is not perfect either, but at least reasonable.

For LLMs, I'd say that SaaS models in general are much better than (small) open-weight models. It's a matter of compute power / resources.

Regarding Gitlab, did you try to tee the build log into a file and upload it as job artifact? The SaaS limit is 5GB. Well, I definitely missed the discussion/decision, do you have a link at hand?

But here, in the LLM space, the situation is different: If you stick
to an SaaS model, you are forced to update the prompt or AGENTS.md file
every two months. Which may be unreasonably costly in the long run.

The AGENTS.md should be relatively vendor/model agnostic. It should describe the directory/file layout, or in general it should contain project-specific information that helps LLMs to reduce token costs (and thus give you answers faster and with higher quality).

The vendor/model specific prompts are contained in the system prompt of the agent. For example Claude Code works best with Anthropic models - because the system prompt is fine-tuned. The gemini agent works best with Google's Gemini models. Etc.

Still, you can use these agents with any model, also with local models.

They all read the same AGENTS.md file.

Tim

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: request for help: LLM-based quality assurance

Reply via email to