On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
I'm writing a utility that checks for specific keyword(s) found in the files in a given directory recursively. What's the best strategy to avoid opening a bin file or some sort of garbage dump? Check encoding of the given file?

If so, what are the most popular encodings (in POSIX if that matters) and how do I detect them?

What I would do is read the frist 512 bytes and the last 512 bytes and if over 50% of those bytes are below 32 and not 8, 9, 10, 11, 12 or 13 then chances are you have a binary file, but there is nothing that stops someone from writing "invalid" bytes into a text file. There are no limitations on what a file can hold and generally the system treats all files the same.

The reason I recommend to read the first 512 and last 512 bytes is because some binary files may contain legit text strings etc. so by picking two places chances are you won't have two segments with text.

Reply via email to