Re: request for help: LLM-based quality assurance

Pavel Cahyna Mon, 08 Jun 2026 06:04:33 -0700

Hi Bruno,

On Fri, Jun 05, 2026 at 05:33:46PM +0200, Bruno Haible wrote:
> Hi Pavel,
> 
> Many thanks for having done this experiment for us!


You are welcome! I have automated it, so I can easily run it for more
commits if there is interest (but I won't have the capacity to analyze
the results).

> > Here are the findings:
> > 
> > > 6aae65332edc58970c0dc287b28c8f1f6250282a
> > 
> > Review: commit 6aae65332e — "quotearg: lessen dependencies"
> > 
> > Verdict: No functional regressions found.
> > 
> > This commit is a straightforward dead-dependency cleanup:
> > 
> > - c-strcaseeq.h removal — The STRCASEEQ macro was eliminated from quotearg.c
> >   in an earlier commit (751e42b0bc, Sep 2025) that rewrote gettext_quote to 
> > use
> >   mbrtoc32 instead of locale_charset() comparisons. The #include was left
> >   behind; this commit removes it.
> > 
> > - memcmp dependency removal — memcmp was replaced by memeq in another 
> > earlier
> >   commit (6ca831b019, Sep 2025). The module dependency was left behind; this
> >   commit removes it. The memeq dependency remains correctly listed.
> > 
> > Both removals are verified: zero occurrences of STRCASEEQ/CASEEQ/memcmp 
> > exist
> > in the post-commit quotearg.c. The remaining dependency list in
> > modules/quotearg correctly reflects actual usage.
> 
> Perfectly correct.
> 
> > > 17dc60e624cd6fc3491f9cb002f760d60e66ce8b
> > 
> > Review of commit 17dc60e624cd6fc3491f9cb002f760d60e66ce8b
> > "mbrtoc32, mbrtowc: do not optimze for uClibc-ng"
> > 
> > Summary: The commit adds `&& !__UCLIBC__` guards to prevent glibc-specific
> > optimizations from running on uClibc-ng, which defines __GLIBC__ but is not
> > actually glibc. The mbrtoc32.c changes are correct. However, mbrtowc.c
> > contains a copy-paste regression.
> > 
> > ---
> > 
> > BUG: Copy-paste error in lib/mbrtowc.c n==0 guard (lines 122-123)
> > 
> > The preprocessor guard for the `n == 0` early return was changed from:
> > 
> >   # if MBRTOWC_EMPTY_INPUT_BUG || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ 
> > >= 2)
> > 
> > to:
> > 
> >   # if (MBRTOC32_EMPTY_INPUT_BUG || _GL_SMALL_WCHAR_T \
> >         || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ >= 2 && !__UCLIBC__))
> > 
> > This line was copied verbatim from the corresponding guard in lib/mbrtoc32.c
> > (line 137), introducing two problems:
> > 
> > 1. MBRTOWC_EMPTY_INPUT_BUG was replaced by MBRTOC32_EMPTY_INPUT_BUG. These 
> > are
> >    separate macros defined by separate m4 tests (m4/mbrtowc.m4 vs
> >    m4/mbrtoc32.m4), probing different functions. On a system where mbrtowc()
> >    has the empty-input bug but mbrtoc32() does not, the guard will no longer
> >    trigger, and rpl_mbrtowc(pwc, s, 0, ps) will fall through to the buggy
> >    system mbrtowc() instead of returning (size_t) -2.
> > 
> > 2. _GL_SMALL_WCHAR_T was added, but this macro is defined only in
> >    lib/uchar.in.h, which mbrtowc.c does not include. This term is dead code
> >    in this compilation unit.
> > 
> > Note the inconsistency within the file itself: line 113 still correctly uses
> > MBRTOWC_EMPTY_INPUT_BUG in the s==NULL guard, while line 122 was changed to
> > the wrong macro.
> > 
> > The correct fix should have been:
> > 
> >   # if (MBRTOWC_EMPTY_INPUT_BUG \
> >         || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ >= 2 && !__UCLIBC__))
> > 
> > Impact: Regression on systems where mbrtowc() has the empty-input bug but
> > mbrtoc32() does not. On such systems, calling rpl_mbrtowc() with n=0 will
> > no longer be handled correctly.
> 
> Perfectly correct as well! It detected the regression and explained it in
> understandable terms.
> 
> > Severity: Medium. The bug requires a specific combination of platform
> > characteristics (mbrtowc empty-input bug present, mbrtoc32 empty-input bug
> > absent), but on affected systems it silently produces incorrect behavior.
> 
> I'm not familiar with this "Severity" scale. Maybe other people can make use
> of it. Anyway, this paragraph is not wrong.

I don't think that this Severity scale is anything standardized, I would
say it is purely subjective (in less charitable terms: the model made it
up). Another run with a similar prompt yielded this:

--->8---
Review of commit 17dc60e624cd6fc3491f9cb002f760d60e66ce8b
"mbrtoc32, mbrtowc: do not optimze for uClibc-ng"

VERDICT: REGRESSION INTRODUCED

The intended change — adding `&& !__UCLIBC__` to glibc-specific optimizations
in mbrtoc32.c and mbrtowc.c — is correct in principle. However, the `n == 0`
early-return guard in lib/mbrtowc.c (line 122) was replaced wholesale with the
condition from mbrtoc32.c instead of being surgically modified, introducing two
bugs.

Bug 1: Wrong macro name (HIGH severity, HIGH confidence)
---------------------------------------------------------
The condition in mbrtowc.c changed from:

  # if MBRTOWC_EMPTY_INPUT_BUG || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ >= 2)

to:

  # if (MBRTOC32_EMPTY_INPUT_BUG || _GL_SMALL_WCHAR_T \
        || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ >= 2 && !__UCLIBC__))

MBRTOWC_EMPTY_INPUT_BUG was replaced with MBRTOC32_EMPTY_INPUT_BUG. These are
distinct autoconf macros defined in separate m4 files (m4/mbrtowc.m4 vs
m4/mbrtoc32.m4), testing different functions. On platforms where mbrtowc has
the empty-input bug but mbrtoc32 does not (e.g., CentOS 7, MSVC), the n == 0
guard is silently omitted, causing incorrect behavior in rpl_mbrtowc.

Bug 2: Spurious _GL_SMALL_WCHAR_T added (MEDIUM severity, HIGH confidence)
---------------------------------------------------------------------------
_GL_SMALL_WCHAR_T was never used in mbrtowc.c before this commit. It is an
mbrtoc32.c-specific concern (defined in uchar.in.h, relevant to char32_t
conversions on platforms with 16-bit wchar_t). Adding it to mbrtowc.c causes
the n == 0 early return to fire on Windows/Cygwin platforms where it previously
did not — an unintended behavioral change.

Root cause
----------
Copy-paste of the entire preprocessor condition from mbrtoc32.c line 137 into
mbrtowc.c line 122, instead of only appending `&& !__UCLIBC__` to the existing
condition. The correct fix for mbrtowc.c should have been:

  # if (MBRTOWC_EMPTY_INPUT_BUG \
        || (GNULIB_WCHAR_SINGLE_LOCALE && __GLIBC__ >= 2 && !__UCLIBC__))

Note: This regression was subsequently fixed in commit 2ca51a77e6
("mbrtowc: Fix handling of n==0 on CentOS 7, MSVC (regr. 2026-05-25)").
--->8---

See - now it claimed HIGH severity for the same problem. It also claimed
that the second problem is "an unintended behavioral change.", while
another attempt produced this result:

--->8---
2. _GL_SMALL_WCHAR_T was added, but this macro is only meaningful in mbrtoc32.c
   (where it gates a dedicated code path at line 294). In mbrtowc.c there is no
   _GL_SMALL_WCHAR_T code path; the macro is irrelevant. This is harmless in
   practice (it would only cause a redundant early return that happens to be
   correct), but it is unintended.
--->8---

So, now it claims that it is harmless in practice because the return
happens to be correct.

The lack of reproducibility (obtaining quite different outputs for
prompts with no material difference, or no difference at all) is IMO a
major obstacle for using LLMs.

> > > 53d4558960659ba7c4e9e2757bfb0977a5027fae
> > 
> > Review of commit 53d4558960 — "quotearg: be nicer on macOS etc"
> > 
> > Overall: No functional regressions found. The commit correctly addresses
> > macOS/Android/musl C locales that use UTF-8 by introducing a
> > C_LOCALE_MIGHT_BE_MULTIBYTE macro and falling back to <wchar.h> / <wctype.h>
> > functions.
> > 
> > Findings
> > --------
> > 
> > 1. ChangeLog entry: missing closing paren (Low / cosmetic)
> > 
> > The commit message has "(wch, mbstate, mbs_clear, mbrtowch, chisprint,
> > wchisprint" followed by "(quotearg_buffer_restyled):" — missing "):" after
> > "wchisprint". The same typo is in the ChangeLog file. Documentation-only,
> > no functional impact.
> > 
> > 2. Preprocessor indentation inconsistency (Low / cosmetic)
> > 
> > Line 73: "# include <wctype.h>" uses one space after #, but it sits at
> > nesting depth 2 (inside #else -> # if USE_C_LOCALE). The sibling at line 76,
> > "#  include <uchar.h>", correctly uses two spaces. Should be
> > "#  include <wctype.h>" per the file's convention. No functional impact.
> > 
> > 3. Dead (size_t)-3 guards when using mbrtowc (Informational)
> > 
> > When USE_C_LOCALE && C_LOCALE_MIGHT_BE_MULTIBYTE, mbrtowch calls mbrtowc
> > which never returns (size_t)-3. The #if !GNULIB_MBRTOC32_REGULAR guards that
> > handle -3 become dead code. The mbsinit loop-control guard is redundant but
> > correct (mbsinit always returns true after a complete mbrtowc conversion).
> > No incorrect behavior is possible — this is functionally harmless dead code.
> > 
> > No critical or medium-severity issues. The core logic — preprocessor
> > partitioning, type selection, unibyte_locale fix, and mbrtowc/iswprint
> > fallbacks — is correct.
> 
> I had not seen finding 1.
> Finding 2 I had seen and corrected in a follow-up commit.
> Finding 3: OK, it explains that it is "functionally harmless dead code".
> However, it did miss the regression: The function c_isprint is used
> if USE_C_LOCALE is true; however the include of <c-ctype.h> was changed
> from  USE_C_LOCALE  to  USE_C_LOCALE && !C_LOCALE_MIGHT_BE_MULTIBYTE.

My first attempt produced a different result:
--->8---
(...)
  5. MUSL_LIBC detection fragility (medium concern): This is the most notable
  issue. MUSL_LIBC is only defined via m4/musl.m4 through the mbszero module's
  configure machinery — musl provides no compiler-predefined macro. The old
  comment listed mbszero as a module that USE_C_LOCALE consumers "might avoid";
  the new comment explicitly says they must keep it. If an existing consumer on
  musl was previously omitting mbszero (as the old docs permitted), MUSL_LIBC
  would be undefined, C_LOCALE_MIGHT_BE_MULTIBYTE would be false, and the
  single-byte fallback would still be used — silently failing to apply the fix.
  There is no compile-time diagnostic for this case. This isn't a regression
  (the old code was also broken on musl), but it's a fragility in the fix that
  could leave musl consumers in the broken state without warning.
  6. <wctype.h> include scope (no issue): Correctly guarded by # if USE_C_LOCALE
   inside the #else block — only included on the new middle path, not on the
  !USE_C_LOCALE path. Minor indentation inconsistency (# include vs #  include)
  is cosmetic only.
 
  Verdict
 
  The commit is sound. The main thing worth flagging to the author is the musl
  detection fragility (finding 5) — a #warning or _Static_assert when
  USE_C_LOCALE is set but MUSL_LIBC is not defined on a potentially-musl system
  could help catch misconfigured consumers, though musl's deliberate lack of a
  predefined macro makes this hard to enforce perfectly.
--->8---
Not sure whether it gets any closer to the real regression.

> Finding one regression out of two with such a tool, is impressing.
> 
> Two questions:
> 
>   * Where did it get the information about the Gnulib modules from?
>     "The file already includes `<stdint.h>` (and depends on the gnulib
>      `stdint-h` module) ..."
>     Is it something that you explained in a prompt? Or did it "digest"
>     the Gnulib documentation first?
>     More generally, what's the size of the Gnulib specifics in the prompt
>     that you had to provide?

I have not provided any Gnulib specifics at all. It has figured out
everything by itself. The prompt I used was merely this:

Review commit $commit in this repository, the commit diff is above. The
currently checked out revision may be different, so always read files
using `git show`. We are already in the repository, so any argument to
git of type `-C <current directory>` is unnecessary. Focus on possible
functionality regressions, not on minor/style issues. First examine the
commit to get understanding and then spawn a review agent to do the deep
analysis. Then, fan out subagents to review the finding for correctness,
as the finding may contain both false negative and false positive
results. Use one subagent ber bug found, or just one subagent if the
review found no bugs. Report the final verdict. Finally, save an
identical copy of the final verdict as file review-commit-<8-letter
short commit id>.txt.

I found that even this simple prompt:
Review commit $commit in this repository.
Focus on possible functionality regressions, not on minor/style issues.

produced good results. I then added the part about using subagents to
review the findings, hoping that this would improve the accuracy of the
findings. I also added instructions to avoid some common pitfalls that
the assistant encountered from time to time, and some instructions to
help automate the process.

In addition to the irreproducibility issues one may observe other issues
here: the model has problems following basic instructions. I told it to
"Focus on possible functionality regressions, not on minor/style
issues." but it still reports minor issues as well. What you can't see
in the output is also that it does not respect the instruction "save an
identical copy of the final verdict as file" because the content of the
file is different and consistently more verbose than what it produced in
the terminal.

>   * What's an estimate for the cost that the analysis of these 4 commits
>     costed you or your employer (excluding your salary's cost)?
>     I'd like to compare it to the CI costs. The GitLab CI and GitHub CI
>     are zero-cost for us, but nonetheless I like to have an estimate based
>     on the (processing time * number of CPUs * electricity cost).

I can't provide this, unfortunately. I can only give you exact
instructions and you can try yourself. Pricing may differ between
customers, anyway.

Best regards, Pavel

Re: request for help: LLM-based quality assurance

Reply via email to