Re: [PHP-DEV] IntlCharsetDetector

Derick Rethans Tue, 05 Apr 2016 02:06:13 -0700

On Mon, 4 Apr 2016, Sara Golemon wrote:

> The subject of character set detection (yes, I know, a hard problem to
> solve) came up on SO chat, and Niki noticed that we don't yet wrap the
> ICU UCharsetDetector API so I volunteered to put something together.
> 
> https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector
> 
> The trouble is, for the WIDE majority of my test cases so far, ICU is
> really bad at detecting character sets correctly (as I said, it's a
> tough problem).  In fact, the ICU manual admits that it doesn't even
> look at all of the corpus text, and the "language detection" is a
> byproduct not meant for actual language detection.
> 
> Given all that, I'm inclined to reject the idea of rolling this into
> PHP for fear of just confusing users without actually adding any
> value.
> 
> Thoughts?


I would advice against adding this.

As you say, it doesn't work properly. As a matter of fact, guessing 
charsets, like timezones, is not possible. You need to know which 
charset something is in. If not, you need to address *that* problem.

cheers,
Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] IntlCharsetDetector

Reply via email to