Re: request for help: LLM-based quality assurance

Pavel Cahyna Mon, 08 Jun 2026 08:11:13 -0700

On Mon, Jun 08, 2026 at 04:35:56PM +0200, Bruno Haible wrote:
> Hi Pavel,
> 
> Thanks for the details.
> 
> > Another run with a similar prompt yielded this:
> > 
> > --->8---
> > Review of commit 17dc60e624cd6fc3491f9cb002f760d60e66ce8b
> > "mbrtoc32, mbrtowc: do not optimze for uClibc-ng"
> > 
> > VERDICT: REGRESSION INTRODUCED
> > 
> > The intended change — adding `&& !__UCLIBC__` to glibc-specific 
> > optimizations
> > in mbrtoc32.c and mbrtowc.c — is correct in principle. However, the `n == 0`
> > early-return guard in lib/mbrtowc.c (line 122) was replaced wholesale with 
> > the
> > condition from mbrtoc32.c instead of being surgically modified, introducing 
> > two
> > bugs.
> > 
> > Bug 1: Wrong macro name (HIGH severity, HIGH confidence)
> > ---------------------------------------------------------
> > The condition in mbrtowc.c changed from:
> > 
> >   # if MBRTOWC_EMPTY_INPUT_BUG || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ 
> > >= 2)
> > 
> > to:
> > 
> >   # if (MBRTOC32_EMPTY_INPUT_BUG || _GL_SMALL_WCHAR_T \
> >         || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ >= 2 && !__UCLIBC__))
> > 
> > MBRTOWC_EMPTY_INPUT_BUG was replaced with MBRTOC32_EMPTY_INPUT_BUG. These 
> > are
> > distinct autoconf macros defined in separate m4 files (m4/mbrtowc.m4 vs
> > m4/mbrtoc32.m4), testing different functions. On platforms where mbrtowc has
> > the empty-input bug but mbrtoc32 does not (e.g., CentOS 7, MSVC), the n == 0
> > guard is silently omitted, causing incorrect behavior in rpl_mbrtowc.
> > 
> > Bug 2: Spurious _GL_SMALL_WCHAR_T added (MEDIUM severity, HIGH confidence)
> > ---------------------------------------------------------------------------
> > _GL_SMALL_WCHAR_T was never used in mbrtowc.c before this commit. It is an
> > mbrtoc32.c-specific concern (defined in uchar.in.h, relevant to char32_t
> > conversions on platforms with 16-bit wchar_t). Adding it to mbrtowc.c causes
> > the n == 0 early return to fire on Windows/Cygwin platforms where it 
> > previously
> > did not — an unintended behavioral change.
> > 
> > Root cause
> > ----------
> > Copy-paste of the entire preprocessor condition from mbrtoc32.c line 137 
> > into
> > mbrtowc.c line 122, instead of only appending `&& !__UCLIBC__` to the 
> > existing
> > condition. The correct fix for mbrtowc.c should have been:
> > 
> >   # if (MBRTOWC_EMPTY_INPUT_BUG \
> >         || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ >= 2 && !__UCLIBC__))
> 
> Good to see that it found the regression this time as well.


Sure. But it disagrees on whether the second finding is a problem or
not.

> > Note: This regression was subsequently fixed in commit 2ca51a77e6
> > ("mbrtowc: Fix handling of n==0 on CentOS 7, MSVC (regr. 2026-05-25)").
> 
> Hmm. Did it find the regression, ot did it look in newer commits...?

I realized as well that it could have cheated by looking into the newer
commits, so I checked the full conversation transcript and verified that
it looked at the log only after finding the bug. Of course, finding the
fix could still have influenced its judgement about the bug. For that
reason, in the next attempt I checked out the commit in question first.
This way, the subsequent fix is not reachable from the HEAD commit. That
was the method I used to produce the results in the first e-mail, and as
you saw, it still produced good results. But I don't know to what extent
the differences between the runs are explained by having access to the
newer commits.

> > The lack of reproducibility (obtaining quite different outputs for
> > prompts with no material difference, or no difference at all) is IMO a
> > major obstacle for using LLMs.
> 
> Well, we are used to this lack of reproducibility from human work (e.g.
> emails) anyway. So, IMO, for code reviews it does not matter much. The
> main area where lack of reproducibility is a problem is for people who do
> vibe coding: it forces the generated code to be considered as "source code".
> (Whereas when you use a compiler to transform C code to x86_64 machine code,
> it is reproducible and therefore the .o files don't need to be considered
> as "source code".)

The problem is when you want to tweak/debug your prompt, especially for
automated/mass use: you don't know whether the differences in results are due
to your tweaks or just the general irreproducibility.

> > My first attempt produced a different result:
> > --->8---
> > (...)
> >   5. MUSL_LIBC detection fragility (medium concern): This is the most 
> > notable
> >   issue. MUSL_LIBC is only defined via m4/musl.m4 through the mbszero 
> > module's
> >   configure machinery — musl provides no compiler-predefined macro.
> 
> Indeed, that is a problem that I fixed in a subsequenct commit
> dfc349cede4dcf4bb033420f61f70dd5e73b7631. But again, I don't know whether
> it found this issue by looking at the newer commits...

I looked at the complete transcript. The assistant has not looked at the
git log, so it has not found this by looking at newer commits (when it
does that, it says that the issue is fixed in a newer commit...)

Interestingly, the output from the review subagent is perhaps more
informative than the summary provided in the main session:

       (f) MUSL_LIBC detection fragility -- MEDIUM severity, conditional 
regression risk

       MUSL_LIBC is defined only through gl_MUSL_LIBC in m4/musl.m4, invoked by 
the mbszero module's configure.ac. musl deliberately provides no 
compiler-defined identifying macro.

       The commit changed the documented contract: the old comment listed 
mbszero as a module that USE_C_LOCALE consumers "might avoid"; the new comment 
removed mbszero from that avoidable list and added "they rely on the mbszero 
module defining MUSL_LIBC as needed."

       If an existing USE_C_LOCALE consumer on musl was previously omitting 
mbszero (as the old comment permitted), after this commit MUSL_LIBC would be 
undefined, C_LOCALE_MIGHT_BE_MULTIBYTE would be false, and the code would fall 
back to single-byte treatment -- exactly the bug the commit aims to fix. The fix
       silently fails to apply for such consumers without any compile-time 
diagnostic.

       In contrast, __ANDROID__ and __APPLE__ && __MACH__ are 
compiler-predefined macros that require no configure-time detection, making the 
fix robust on those platforms.

       This is not a regression in the strict sense (the old code was also 
broken on musl), but it is a fragility in the fix. A _Static_assert or #warning 
when USE_C_LOCALE && !defined MUSL_LIBC on a platform that might be musl could 
help catch this. However, since musl provides no compiler macro to test against,
        there's no perfect solution.

> > I have not provided any Gnulib specifics at all. It has figured out
> > everything by itself.
> 
> That is impressing. I would guess that with an explanation of the
> Gnulib conventions (the modules and the meaning of each part of the
> module descriptions [available in the documentation]) we would get
> even better results.

Perhaps. We can try that (I suppose that irreproducibility will
start becoming a problem at this point ...)

> > I found that even this simple prompt:
> > Review commit $commit in this repository.
> > Focus on possible functionality regressions, not on minor/style issues.
> > 
> > produced good results.
> 
> That's remarkable!

Indeed, LLMs present a surprising combination of advanced skills and
gross incompetence :)

Pavel

Re: request for help: LLM-based quality assurance

Reply via email to