On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote: > Hello, > > > > Subversion console client try to detect binary file with algorythm: > > 1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not > check as first N bytes is corret UTF-8?); > 2. File is BINARY if first 1024 bytes contains ZERO byte (uniform > distribution of bytes takes change of absent ZERO byte: (1 - 1 / > 256) ^ 1024 = ~1.8%); > 3. File is BINARY if first 1024 bytes contains over 85% of characters > not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary" > bytes, ~60%). > > This algoritm looks like broken. >
Can you suggest a better algoritm?