On Wed, Apr 6, 2016 at 9:18 AM, Sebastian Bergmann <sebast...@php.net>
wrote:

> Am 05.04.2016 um 11:05 schrieb Derick Rethans:
> > I would advice against adding this.
> >
> > As you say, it doesn't work properly. As a matter of fact, guessing
> > charsets, like timezones, is not possible. You need to know which
> > charset something is in. If not, you need to address *that* problem.
>
>  Agreed.


The problem is, developers are going to write code to guess character sets.

Ironically, PHPUnit attempts to detect UTF-8
<https://github.com/sebastianbergmann/phpunit/blob/master/src/Util/String.php#L38-L70>.
There is also no shortage of SO posts explaining other approaches. My
favorite is using a preg_match trick
<http://stackoverflow.com/a/4407996/2908724>.

I'd rather we include the patch for a few reasons:

1. so that there's a modern "standard" method of doing so, and that
"standard" method has plenty of documentation that points people to the
limitations.
2. to completely expose the underlying ICU, rather than arbitrarily
deciding one part isn't good for developers to use.
3. to provide an alternative to mb_detect_encoding.

While I can't say if this will or won't cause more user confusion, I do
believe this adds value: ICU provides a confidence metric, which no other
in-built or buildable solution (to my knowledge) provides.

Reply via email to