On Sat, May 02, 2026 at 12:38:08AM +1000, raf <[email protected]> wrote:

> Hi,
> 
> I have a locale directory and an older locale-bak
> directory in a project's source directory. They both
> contain gettext .po files that are all utf8-encoded.
> 
> Why might "diff -durpN locale-bak locale" produce a
> non-utf8-encoded diff file?
> 
> The file(1) command reports all of the .po files like this:
> 
>   GNU gettext message catalogue, Unicode text, UTF-8 text
> 
> It reports the resulting diff file like this:
> 
>   unified diff output, Non-ISO extended-ASCII text, with LF, NEL line 
> terminators
> 
> This is with diff (GNU diffutils) 3.8 on debian12 and
> diff (GNU diffutils) 3.12 on macos-10.14.
> 
> Any idea what I'm doing wrong to make this happen?
> I would expect the diff output to be utf8-encoded
> (and readable in vim).
> 
> Hmm, if I leave out the -p option, it works correctly,
> and the diff output is utf8-encoded.
> 
> Admittedly, I don't need the -p option for gettext .po
> files, but I always use the shell alias d='diff -durpN'
> and it usually does no harm with non-C files.
> 
> Using diff -p on two gettext .po files does produce
> valid utf8-encoded output, but diff -rp on two
> directories containing gettext .po files doesn't.
> 
> I tried again with a single language's translation in
> both directories, and it produced correct utf8. The
> original directories had 47 language directories each.
> So it doesn't always happen. I don't know how many
> files it takes for the problem to occur.
> 
> In case it's helpful, I've put a temporary copy of the
> two locale directories and the resulting diff output at
> raf.org/tmp/diffutils-rp-utf8.tar.gz (460K).
> 
> cheers,
> raf

The attempt to do -p on the directories in that tarball
results in "illegal bytes" in locale.diff in the following
lines:

  257: @@ -450,9 +471,11 @@ msgstr "пътят е твърде дъ�
  443: @@ -718,9 +737,25 @@ msgstr "কমান্ডটি খু�
  1271: @@ -450,9 +471,11 @@ msgstr "η διαδρομή είναι �
  1304: @@ -718,9 +738,25 @@ msgstr "η εντολή είναι πο�
  2443: @@ -505,7 +528,7 @@ msgstr "अमान्य -f विक�
  6444: @@ -718,9 +737,25 @@ msgstr "கட்டளை மிகப�
  7548: @@ -152,11 +169,11 @@ msgstr "  -i           - 包含 inode �
  7562: @@ -244,7 +261,7 @@ msgstr "                   @ 符号链�
  7660: @@ -528,7 +551,7 @@ msgstr "最终条件表达式后应为�
  7669: @@ -581,7 +604,7 @@ msgstr "函数参数列表中应为“)�
  7696: @@ -617,7 +640,7 @@ msgstr "预期纳秒数,实际测量�

Maybe the -p code isn't careful enough to output whole code points
and is truncating? Just guessing.

cheers,
raf




Reply via email to