Re: [Rd] Non-ASCII citation keys prevent compiling with LC_ALL=C

Kurt Hornik Sat, 16 Aug 2025 23:01:37 -0700

>>>>> Ivan Krylov via R-devel writes:

> Hello R-devel,
> I've been watching the development of automatic Rd bibliography
> generation with great interest and I'm looking forward to using
> \bibcitet{...} and \bibshow{*} in my packages.


Thanks! :-)

> Currently, non-ASCII characters used in the citation keys prevent R
> from successfully compiling when the current locale encoding is unable
> to represent them:

> % touch src/library/stats/man/factanal.Rd && LC_ALL=C make
> ...
> installing parsed Rd
> make[3]: Entering directory '.../src/library'
>   base
> Error: factanal.Rd:99: (converted from warning) Could not find
> bibentries for the following keys: %s
>   'R:J<U+00F6>reskog:1963'
> Execution halted
> make[3]: *** [Makefile:76: stats.Rdts] Error 1

> But as long as the locale encoding can represent the key, it's fine:

> % touch src/library/stats/man/factanal.Rd && \
>  LC_ALL=en_GB.iso885915 luit make
> (works well without a UTF-8 locale)

Oh dear.  I thought we have coverage for this from building daily
snapshots with LC_ALL=C, but apparently not.  There were 10 non-ASCII
keys so far: I have for now changed them to all ASCII.

But clearly, when a package declares its Rd files to be in UTF-8 one
would expect that Sexpr macros can also take UTF-8, but that's not so
simple given that it involves calling the R parser.  Your suggested
change looks good to me: non-UTF-8 MBCS locales have a problem with
parse(encoding = "UTF-8"), but I don't think we have real coverage for
these.

(Afaic, in principle, it might be nice to make these "work" via writing
to a tempfile, parsing from their with re-encoding, and at the end run
enc2utf8() on all strings obtained, but that's not so simple ...)

Anyway, need to discuss this a bit more within R Core.  For now, things
"work" again with LC_ALL=C.  

(My regular checks use C.UTF-8, but I am not sure how universally
available this is?)

Best


> I think this can be made to work by telling tools:::process_Rd() ->
> tools:::processRdChunk() to parse character strings in R code as UTF-8:

> Index: src/library/tools/R/RdConv2.R
> ===================================================================
> --- src/library/tools/R/RdConv2.R     (revision 88617)
> +++ src/library/tools/R/RdConv2.R     (working copy)
> @@ -229,8 +229,8 @@
>       code <- structure(code[tags != "COMMENT"],
>                         srcref = codesrcref) # retain for error locations
>       chunkexps <- tryCatch(
> -         parse(text = sub("\n$", "", as.character(code)),
> -               keep.source = options$keep.source),
> +         parse(text = sub("\n$", "", enc2utf8(as.character(code))),
> +               keep.source = options$keep.source, encoding = "UTF-8"),
>           error = function (e) stopRd(code, Rdfile, conditionMessage(e))
>       )
 
> That enc2utf8() may be extraneous, since tools::parse_Rd() is
> documented to convert text to UTF-8 while parsing. The downsides are,
> of course, parse(encoding=...) not working with MBCS locales and the
> ever-present danger of breaking some user code that depends on the
> current behaviour (this was tested using 'make check-devel', not on
> CRAN packages).

> Should R compile under LC_ALL=C? Maybe it's time for people whose
> builds are failing to switch the continuous integration containers from
> C to C.UTF-8?

> -- 
> Best regards,
> Ivan

> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Non-ASCII citation keys prevent compiling with LC_ALL=C

Reply via email to