tag 30326 notabug thanks On 02/02/2018 01:30 PM, L. A. Walsh wrote: > I've used grep to search through my mbox-format emails for decades, but > I've run into a case where it seems to be ignore a text mailbox > because, I guess, it thinks it is "binary"
Yes, that's correct. > If I used "-Par" it finds it. Yes, that's also correct. > > It seems that grep believes the file to binary and ignores it, though > "file" calls it "text". The file is conditionally text. The POSIX definition of a text file is one whose lines consist of valid characters in the current locale - but note this definition is locale-dependent! So a file that is text under one locale may be binary under another. When you are grepping a file encoded correctly for the current locale, you get the output you want; when you are grepping a file that contains encoding errors for the current locale, POSIX says behavior is undefined, so GNU grep warns you that the file is binary (in the current locale); and your use of -a tells grep to process it anyways. As 'file' reported that your file was using non-ISO extended-ASCII, it probable means the file was encoded for an 8-bit single-byte locale; and my guess is that you were running grep under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte inputs as encoding errors. Hence the warning that your file is binary, under the current locale. You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a valid character, and thus where you will never encounter encoding errors (you may encounter OTHER things that make your file binary, such as embedded NULs, but that's a different matter). This behavior is documented and intentional, so I'm closing this as not a bug in the tracker. However, feel free to add further comments or questions to the thread. And perhaps we could tweak the grep diagnostics to clarify whether a file is binary because NUL bytes were encountered, vs. a file is binary because encoding errors were encountered. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Description: OpenPGP digital signature