bug#30326: grep not searching through a text file (thinking it binary)

L A Walsh Fri, 02 Feb 2018 12:10:32 -0800

Grep was around long before POSIX, as were most of the unix
utils.


Grep was able to find text strings in mboxes without a POSIX

definition telling it that it was "broken".

I don't want it displaying random binary that throws my
terminal into weird modes, which is why I skip binary
files. To have grep searching through some mailboxes
while skipping others, randomly based on what email
happens to be in the box at the time, is hardly a useful
utility.

I did not ask for POSIXLY_CORRECT -- if you need to have it be
POSIXLY Correct, then use the existing var, but grep is now
broken -- since POSIX doesn't define "text" files "out in the real
world", but only for files that adhere to the POSIX standard.

People don't write emails that adhere to the POSIX standard.

Also, FWIW, grep's manpage doesn't say it is limited to posix-only
files.  It's summary says:
      grep, egrep, fgrep - print lines matching a pattern

which it does not do.  It doesn't say "print lines matching
a pattern only from POSIX text files.



Eric Blake wrote:

tag 30326 notabug
thanks

On 02/02/2018 01:30 PM, L. A. Walsh wrote:

I've used grep to search through my mbox-format emails for decades, but
I've run into a case where it seems to be ignore a text mailbox
because, I guess, it thinks it is "binary"


Yes, that's correct.

If I used "-Par" it finds it.


Yes, that's also correct.

It seems that grep believes the file to binary and ignores it, though
"file" calls it "text".


The file is conditionally text.  The POSIX definition of a text file is
one whose lines consist of valid characters in the current locale - but
note this definition is locale-dependent!  So a file that is text under
one locale may be binary under another.  When you are grepping a file
encoded correctly for the current locale, you get the output you want;
when you are grepping a file that contains encoding errors for the
current locale, POSIX says behavior is undefined, so GNU grep warns you
that the file is binary (in the current locale); and your use of -a
tells grep to process it anyways.  As 'file' reported that your file was
using non-ISO extended-ASCII, it probable means the file was encoded for
an 8-bit single-byte locale; and my guess is that you were running grep
under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte
inputs as encoding errors.  Hence the warning that your file is binary,
under the current locale.

You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a
valid character, and thus where you will never encounter encoding errors
(you may encounter OTHER things that make your file binary, such as
embedded NULs, but that's a different matter).

This behavior is documented and intentional, so I'm closing this as not
a bug in the tracker.  However, feel free to add further comments or
questions to the thread.

And perhaps we could tweak the grep diagnostics to clarify whether a
file is binary because NUL bytes were encountered, vs. a file is binary
because encoding errors were encountered.

bug#30326: grep not searching through a text file (thinking it binary)

Reply via email to