Hello,

03.10.2014 15:35, Stefan Sperling пишет:
On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
    Hello,



    Subversion console client try to detect binary file with algorythm:

     1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
        check as first N bytes is corret UTF-8?);
     2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
        distribution of bytes takes change of absent ZERO byte: (1 - 1 /
        256) ^ 1024 = ~1.8%);
     3. File is BINARY if first 1024 bytes contains over 85% of characters
        not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
        bytes, ~60%).

    This algoritm looks like broken.

Can you suggest a better algoritm?
About false positive:

1. If text file detected as binary:
     * with "svn:auto-props = '*.txt = svn:eol-style=native'" svn
       client block adding this file: svn:eol-style and
       svn:mime-type=application/octet-stream can't be defined
       simultaneously;
       You have a workaround:
         o create empty file;
         o run svn add for empty file;
         o replace empty file by real data;
         o commit.
     * you can't diff and merge this file (Cannot display: file marked
       as a binary type.).
       You can't fix it, because you can't remove svn:mime-type
       property in last modified revision.
2. If binary file detected as text:
     * svn diff and merge display unusable output.
       You can fix it in current revision by set svn:mime-type property.

I think, false positive, when text file detected as binary is more annoying.


About file type detection:

1. File detection algorythm must be as simple, as possible.
2. If first N bytes contains ZERO byte - file is binary.
3. If file is valid UTF-8 - file is text.
4. If file contains too many binary characters - file is binary.
   I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C,
   0x0E-0x1F, 0x7F (29 characters, ~11.3%).
   This characters very rarely uses in text files. Characters from
   range 0x80-0xFF can identify as letters in some encodings.
   Comparison threshold should be significantly lower than the
   percentage of data characters in a normal distribution.
   For example, if file contains about 2.5% of N bytes as "binary"
   characters - this file is binary.


Overall, I seem to be successful following implementations:

1. Git autodetection: if first 8000 bytes contains ZERO byte - file is
   binary.
   + As simple, as possible;
   + Can't detect text files as binary;
   - Can detect some binary files as text;
2. Byte range autodetection: if first N bytes contains byte from range
   0x00-0x08 or 0x0E-0x1F - file is binary.
   + Still simple;
   - Can detect some short binary files as text;
3. Byte range autodetection: if first N bytes contains about N% of
   bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is binary.
   - Not so simple;



Best regards,
Navrotskiy Artem.

Reply via email to