Hi Walter, Walter Alejandro Iglesias wrote on Sun, Jun 01, 2025 at 07:41:48PM +0200: > On Sat, May 31, 2025 at 10:45:17AM -0000, Stuart Henderson wrote:
>> ggrep does in this instance, but I don't know how reliable that is. > I had already forgotten about a problem I encountered with GNU grep > under Linux while writing a shell script to process mbox files long time > ago. Some of the messages in my mbox files were iso-latin encoded > (Spanish,) since my locales were UTF-8, a grep command in a pipe at the > end of my script printed the message "binary file matches" and removed > from the output any line containing invalid UTF-8 sequences considering > them garbage from a binary file. This is what still happens under Linux > (\xed is latin-1 iacute): > > $ printf '\xedHello\n' > test > $ grep Hello test > grep: test: binary file matches > $ LANG=C grep Hello test > �Hello > > I mention this as a practical example of the trade-offs of using > wide-character functions. Indeed, i discussed that kind of problem in https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf pages 4, 22, 26, 27, 30, 31, 39 using FreeBSD/NetBSD rev(1), cut(1), and ul(1) as examples rather than grep(1), but the traps are similar. In particular, when implementing UTF-8-only multibyte character support, be very careful how you want to handle invalid bytes and invalid byte sequences. Usually, it is possible to handle these gracefully and recover because UTF-8 is self-synchronizing. But there isn't a silver bullet that always hits the mark, every single task needs individual consideration. When you embark on the fool's errand of trying to support arbitrary character sets, you are guaranteed to lose significant functionality in the process, and on top of that, the end result is usually insecure in multiple ways. Yours, Ingo