Re: Binary recognition is to narrow [new suggestion]

Shigio YAMAGUCHI Fri, 20 Nov 2009 22:31:04 -0800

> Instead of counting characters over 127 the only test is that the first
> 511 bytes don't contain any of the controll characters 0-8, 14-31. No
> normal textfile would contain these.
> 
> Assuming that binary data is random the probability of a incorrectly
> tagged binary would be
> 
> ((256-8-18)/256)^511=.00000000000000000000000170726
> 
> just testing 127 bits would be a bit to little
> 
> ((256-8-18)/256)^127=.00000123868


This is a very interesting idea.

> One of the benefits is that this will correctly tag files in uni-code as
> text as well. Since those control characters never appears in uni-code
> either.

This is a big merit.
Most other multi-byte character set are sure to be designed like that,

I would like to make the 512 a customizable variable too.

$ gtags                         ... use conventional test

[File gtags.conf]
+----------------------------
|...
|       :binarytest_size=512:...  ----------------------------------+
|                                                                   |
                                                                    v
$ gtags                         ... use new test using the first n=512 bytes

After testing for a while, we can decide what we should do.
Thank you for your profitable consideration.
--
Shigio YAMAGUCHI <[email protected]>
PGP fingerprint: D1CB 0B89 B346 4AB6 5663  C4B6 3CA5 BBB3 57BE DDA3


_______________________________________________
Bug-global mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-global

Re: Binary recognition is to narrow [new suggestion]

Reply via email to