On Sat, May 02, 2026 at 12:38:08AM +1000, raf <[email protected]> wrote:
> Hi, > > I have a locale directory and an older locale-bak > directory in a project's source directory. They both > contain gettext .po files that are all utf8-encoded. > > Why might "diff -durpN locale-bak locale" produce a > non-utf8-encoded diff file? > > The file(1) command reports all of the .po files like this: > > GNU gettext message catalogue, Unicode text, UTF-8 text > > It reports the resulting diff file like this: > > unified diff output, Non-ISO extended-ASCII text, with LF, NEL line > terminators > > This is with diff (GNU diffutils) 3.8 on debian12 and > diff (GNU diffutils) 3.12 on macos-10.14. > > Any idea what I'm doing wrong to make this happen? > I would expect the diff output to be utf8-encoded > (and readable in vim). > > Hmm, if I leave out the -p option, it works correctly, > and the diff output is utf8-encoded. > > Admittedly, I don't need the -p option for gettext .po > files, but I always use the shell alias d='diff -durpN' > and it usually does no harm with non-C files. > > Using diff -p on two gettext .po files does produce > valid utf8-encoded output, but diff -rp on two > directories containing gettext .po files doesn't. > > I tried again with a single language's translation in > both directories, and it produced correct utf8. The > original directories had 47 language directories each. > So it doesn't always happen. I don't know how many > files it takes for the problem to occur. > > In case it's helpful, I've put a temporary copy of the > two locale directories and the resulting diff output at > raf.org/tmp/diffutils-rp-utf8.tar.gz (460K). > > cheers, > raf The attempt to do -p on the directories in that tarball results in "illegal bytes" in locale.diff in the following lines: 257: @@ -450,9 +471,11 @@ msgstr "пътят е твърде дъ� 443: @@ -718,9 +737,25 @@ msgstr "কমান্ডটি খু� 1271: @@ -450,9 +471,11 @@ msgstr "η διαδρομή είναι � 1304: @@ -718,9 +738,25 @@ msgstr "η εντολή είναι πο� 2443: @@ -505,7 +528,7 @@ msgstr "अमान्य -f विक� 6444: @@ -718,9 +737,25 @@ msgstr "கட்டளை மிகப� 7548: @@ -152,11 +169,11 @@ msgstr " -i - 包含 inode � 7562: @@ -244,7 +261,7 @@ msgstr " @ 符号链� 7660: @@ -528,7 +551,7 @@ msgstr "最终条件表达式后应为� 7669: @@ -581,7 +604,7 @@ msgstr "函数参数列表中应为“)� 7696: @@ -617,7 +640,7 @@ msgstr "预期纳秒数,实际测量� Maybe the -p code isn't careful enough to output whole code points and is truncating? Just guessing. cheers, raf
