On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
I'm writing a utility that checks for specific keyword(s) found
in the files in a given directory recursively. What's the best
strategy to avoid opening a bin file or some sort of garbage
dump? Check encoding of the given file?
If so, what are the most popular encodings (in POSIX if that
matters) and how do I detect them?
What I would do is read the frist 512 bytes and the last 512
bytes and if over 50% of those bytes are below 32 and not 8, 9,
10, 11, 12 or 13 then chances are you have a binary file, but
there is nothing that stops someone from writing "invalid" bytes
into a text file. There are no limitations on what a file can
hold and generally the system treats all files the same.
The reason I recommend to read the first 512 and last 512 bytes
is because some binary files may contain legit text strings etc.
so by picking two places chances are you won't have two segments
with text.