Hi Walter,

Walter Alejandro Iglesias wrote on Sun, Jun 01, 2025 at 07:41:48PM +0200:
> On Sat, May 31, 2025 at 10:45:17AM -0000, Stuart Henderson wrote:

>> ggrep does in this instance, but I don't know how reliable that is.

> I had already forgotten about a problem I encountered with GNU grep
> under Linux while writing a shell script to process mbox files long time
> ago.  Some of the messages in my mbox files were iso-latin encoded
> (Spanish,) since my locales were UTF-8, a grep command in a pipe at the
> end of my script printed the message "binary file matches" and removed
> from the output any line containing invalid UTF-8 sequences considering
> them garbage from a binary file.  This is what still happens under Linux
> (\xed is latin-1 iacute):
> 
>   $ printf '\xedHello\n' > test
>   $ grep Hello test
>   grep: test: binary file matches
>   $ LANG=C grep Hello test
>   �Hello
> 
> I mention this as a practical example of the trade-offs of using
> wide-character functions.

Indeed, i discussed that kind of problem in

  https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf

pages 4, 22, 26, 27, 30, 31, 39
using FreeBSD/NetBSD rev(1), cut(1), and ul(1) as examples
rather than grep(1), but the traps are similar.

In particular, when implementing UTF-8-only multibyte character
support, be very careful how you want to handle invalid bytes
and invalid byte sequences.  Usually, it is possible to handle
these gracefully and recover because UTF-8 is self-synchronizing.
But there isn't a silver bullet that always hits the mark,
every single task needs individual consideration.

When you embark on the fool's errand of trying to support arbitrary
character sets, you are guaranteed to lose significant functionality
in the process, and on top of that, the end result is usually
insecure in multiple ways.

Yours,
  Ingo

Reply via email to