On 4/11/2016 6:36 PM, Stanislav Malyshev wrote:
> Hi!
> 
>>> As you say, it doesn't work properly. As a matter of fact, guessing 
>>> charsets, like timezones, is not possible. You need to know which 
>>> charset something is in. If not, you need to address *that* problem.
> 
> It is true that you can not detect charsets with 100% accuracy. It is,
> however, also true that many charsets can be distinguished with enough
> accuracy to make it useful, especially if you know the set of charsets
> you are dealing with. E.g., Russian had about 5 commonly used encodings
> before everybody started to use UTF-8, and several exotic ones. Being
> able to detect at least the major ones while dealing with a
> heterogeneous library of Russian-language texts is a great help. There
> may be other cases like this.
> 
> The point is even imperfect detection may be useful in certain
> circumstances, and detector being part of ICU hints that people find it
> useful enough to spend time implementing and supporting it. We should
> not ignore that.
> 

I need to agree with Stanislav here completely. Sebastian Bergmann has a
quirky userland detection in its own library and I am sure there are
millions of others who have it. Providing one quirky implementation in
the core at least allows us to improve it over time and userland
improves at the same time (although I doubt that it is possible to
improve this kind of detection to a point where it really works).

On 4/11/2016 4:51 PM, Bishop Bettini wrote:
> What about forcing the consumer to stipulate minimal acceptable
confidence?
> The API would internally filter any matches with confidence strictly lower
> than the given value. Along the lines of:
>
> ucsdet_detect(IntlCharsetDetector $det, int $minimum_confidence): array
> ucsdet_detect_all(IntlCharsetDetector $det, int $minimum_confidence):
array
>
> So the relatively reliable UTF-8 test
> <https://tools.ietf.org/html/rfc3629#section-4> could be written:
>
> if ('UTF-8' === $detector->detect(100)) {
>     // ...
> }
>
> This exposes the heuristics available in ICU and leaves the API flexible,
> while forcing the consumer to consider the fact that this is statistical
> reasoning, not decision.
>

This is actually not such a bad idea to create awareness. At least
better than only documenting it; which probably only good devs read (and
understand).

-- 
Richard "Fleshgrinder" Fussenegger

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to