Re: [PHP-DEV] IntlCharsetDetector

2016-04-14 Thread Tom Worster

On 4/11/16 6:11 PM, Sara Golemon wrote:

On Mon, Apr 11, 2016 at 9:36 AM, Stanislav Malyshev  wrote:

The point is even imperfect detection may be useful in certain
circumstances, and detector being part of ICU hints that people find it
useful enough to spend time implementing and supporting it. We should
not ignore that.


Well, Stas, your informal thumbs up to the idea means enough to me to
at least formalize it into an RFC even though I was previously feeling
negative on it.

I may yet vote no on my own RFC after the discussion period, but as
you say it's worth considering the fact that someone thought it
reasonable enough to actually build into ICU...


The general problem is impossible. If you constrain the question, for 
example as Stas says by knowing the language and choosing between a 
given set of codes, then you may have success. And I'm sure I'm not 
alone in sometimes using a simple heuristic to choose between cp1252 and 
utf8.


But this does not logically imply that ICU CharsetDetector is a suitable 
solution in such cases or that it's a good API or a decent 
implementation. Or that PHP should expose it. An SO chat doesn't 
necessarily count as a feature request.


I'd rather people engineered real solutions specific to their 
requirements than resort to any of the failed attempts to solve the 
general problem.


Tom


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] IntlCharsetDetector

2016-04-11 Thread Sara Golemon
On Mon, Apr 11, 2016 at 9:36 AM, Stanislav Malyshev  wrote:
> The point is even imperfect detection may be useful in certain
> circumstances, and detector being part of ICU hints that people find it
> useful enough to spend time implementing and supporting it. We should
> not ignore that.
>
Well, Stas, your informal thumbs up to the idea means enough to me to
at least formalize it into an RFC even though I was previously feeling
negative on it.

I may yet vote no on my own RFC after the discussion period, but as
you say it's worth considering the fact that someone thought it
reasonable enough to actually build into ICU...

-Sara

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] IntlCharsetDetector

2016-04-11 Thread Fleshgrinder
On 4/11/2016 6:36 PM, Stanislav Malyshev wrote:
> Hi!
> 
>>> As you say, it doesn't work properly. As a matter of fact, guessing 
>>> charsets, like timezones, is not possible. You need to know which 
>>> charset something is in. If not, you need to address *that* problem.
> 
> It is true that you can not detect charsets with 100% accuracy. It is,
> however, also true that many charsets can be distinguished with enough
> accuracy to make it useful, especially if you know the set of charsets
> you are dealing with. E.g., Russian had about 5 commonly used encodings
> before everybody started to use UTF-8, and several exotic ones. Being
> able to detect at least the major ones while dealing with a
> heterogeneous library of Russian-language texts is a great help. There
> may be other cases like this.
> 
> The point is even imperfect detection may be useful in certain
> circumstances, and detector being part of ICU hints that people find it
> useful enough to spend time implementing and supporting it. We should
> not ignore that.
> 

I need to agree with Stanislav here completely. Sebastian Bergmann has a
quirky userland detection in its own library and I am sure there are
millions of others who have it. Providing one quirky implementation in
the core at least allows us to improve it over time and userland
improves at the same time (although I doubt that it is possible to
improve this kind of detection to a point where it really works).

On 4/11/2016 4:51 PM, Bishop Bettini wrote:
> What about forcing the consumer to stipulate minimal acceptable
confidence?
> The API would internally filter any matches with confidence strictly lower
> than the given value. Along the lines of:
>
> ucsdet_detect(IntlCharsetDetector $det, int $minimum_confidence): array
> ucsdet_detect_all(IntlCharsetDetector $det, int $minimum_confidence):
array
>
> So the relatively reliable UTF-8 test
>  could be written:
>
> if ('UTF-8' === $detector->detect(100)) {
> // ...
> }
>
> This exposes the heuristics available in ICU and leaves the API flexible,
> while forcing the consumer to consider the fact that this is statistical
> reasoning, not decision.
>

This is actually not such a bad idea to create awareness. At least
better than only documenting it; which probably only good devs read (and
understand).

-- 
Richard "Fleshgrinder" Fussenegger



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] IntlCharsetDetector

2016-04-11 Thread Stanislav Malyshev
Hi!

>> As you say, it doesn't work properly. As a matter of fact, guessing 
>> charsets, like timezones, is not possible. You need to know which 
>> charset something is in. If not, you need to address *that* problem.

It is true that you can not detect charsets with 100% accuracy. It is,
however, also true that many charsets can be distinguished with enough
accuracy to make it useful, especially if you know the set of charsets
you are dealing with. E.g., Russian had about 5 commonly used encodings
before everybody started to use UTF-8, and several exotic ones. Being
able to detect at least the major ones while dealing with a
heterogeneous library of Russian-language texts is a great help. There
may be other cases like this.

The point is even imperfect detection may be useful in certain
circumstances, and detector being part of ICU hints that people find it
useful enough to spend time implementing and supporting it. We should
not ignore that.

-- 
Stas Malyshev
smalys...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] IntlCharsetDetector

2016-04-11 Thread Bishop Bettini
On Fri, Apr 8, 2016 at 2:20 PM, Sara Golemon  wrote:

> On Thu, Apr 7, 2016 at 9:36 AM, Bishop Bettini  wrote:
> > The problem is, developers are going to write code to guess character
> sets.
> >
> True.  But they're going to put more faith in something in the
> standard distribution, assuming it's passed muster.
>
> > Ironically, PHPUnit attempts to detect UTF-8
> >
> Akward
>
> > I'd rather we include the patch for a few reasons:
> >
> > 1. so that there's a modern "standard" method of doing so, and that
> > "standard" method has plenty of documentation that points people to the
> > limitations.
> >
> In that spirit, how about we put in some stub documentation under the
> intl extension with a paragraph or two on why UCharsetDetector *isn't*
> wrapped, and why it's such a bad idea to try to solve the problem from
> this end.
>
> > 2. to completely expose the underlying ICU, rather than arbitrarily
> > deciding one part isn't good for developers to use.
> >
> Is it arbitrary though?  The fact that coming up with test cases which
> produce reasonable/expected results is half crap-shoot makes this an
> evidence based decision, not a capricious one.
>
> > 3. to provide an alternative to mb_detect_encoding.
> >
> And again in that spirit, I think this is a good argument for going
> E_DEPRECATED on mb_detect_encoding().  The entire conversation which
> led to prototyping an IntlCharsetDetector extension came from the fact
> that mb_detect_encoding() wasn't doing its job well.  Rather than have
> two supported, bad solutions, I think it'd be better to have one
> deprecated (and thus unsupported) bad solution (which is only kept for
> BC).
>
> > While I can't say if this will or won't cause more user confusion, I do
> > believe this adds value: ICU provides a confidence metric, which no other
> > in-built or buildable solution (to my knowledge) provides.
> >
> The confidence metric is useful, but my spidey sense tells me that
> it'll simply be ignored.
>
> How about a compromise.  I'll reorder this patch to be a standalone
> extension and we PECLize it.  If someone REALLY wants to throw caution
> to the wind, they can, but they're on their own when it gives them
> fugly results.


What about forcing the consumer to stipulate minimal acceptable confidence?
The API would internally filter any matches with confidence strictly lower
than the given value. Along the lines of:

ucsdet_detect(IntlCharsetDetector $det, int $minimum_confidence): array
ucsdet_detect_all(IntlCharsetDetector $det, int $minimum_confidence): array

So the relatively reliable UTF-8 test
 could be written:

if ('UTF-8' === $detector->detect(100)) {
// ...
}

This exposes the heuristics available in ICU and leaves the API flexible,
while forcing the consumer to consider the fact that this is statistical
reasoning, not decision.


Re: [PHP-DEV] IntlCharsetDetector

2016-04-08 Thread Sara Golemon
On Thu, Apr 7, 2016 at 9:36 AM, Bishop Bettini  wrote:
> The problem is, developers are going to write code to guess character sets.
>
True.  But they're going to put more faith in something in the
standard distribution, assuming it's passed muster.

> Ironically, PHPUnit attempts to detect UTF-8
>
Akward

> I'd rather we include the patch for a few reasons:
>
> 1. so that there's a modern "standard" method of doing so, and that
> "standard" method has plenty of documentation that points people to the
> limitations.
>
In that spirit, how about we put in some stub documentation under the
intl extension with a paragraph or two on why UCharsetDetector *isn't*
wrapped, and why it's such a bad idea to try to solve the problem from
this end.

> 2. to completely expose the underlying ICU, rather than arbitrarily
> deciding one part isn't good for developers to use.
>
Is it arbitrary though?  The fact that coming up with test cases which
produce reasonable/expected results is half crap-shoot makes this an
evidence based decision, not a capricious one.

> 3. to provide an alternative to mb_detect_encoding.
>
And again in that spirit, I think this is a good argument for going
E_DEPRECATED on mb_detect_encoding().  The entire conversation which
led to prototyping an IntlCharsetDetector extension came from the fact
that mb_detect_encoding() wasn't doing its job well.  Rather than have
two supported, bad solutions, I think it'd be better to have one
deprecated (and thus unsupported) bad solution (which is only kept for
BC).

> While I can't say if this will or won't cause more user confusion, I do
> believe this adds value: ICU provides a confidence metric, which no other
> in-built or buildable solution (to my knowledge) provides.
>
The confidence metric is useful, but my spidey sense tells me that
it'll simply be ignored.

How about a compromise.  I'll reorder this patch to be a standalone
extension and we PECLize it.  If someone REALLY wants to throw caution
to the wind, they can, but they're on their own when it gives them
fugly results.

-Sara

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] IntlCharsetDetector

2016-04-07 Thread Andrea Faulds

Derick Rethans wrote:

As you say, it doesn't work properly. As a matter of fact, guessing
charsets, like timezones, is not possible. You need to know which
charset something is in. If not, you need to address *that* problem.


Indeed, 畂桳栠摩琠敨映捡獴!

--
Andrea Faulds
https://ajf.me/

P.S. Google it.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] IntlCharsetDetector

2016-04-07 Thread Bishop Bettini
On Wed, Apr 6, 2016 at 9:18 AM, Sebastian Bergmann 
wrote:

> Am 05.04.2016 um 11:05 schrieb Derick Rethans:
> > I would advice against adding this.
> >
> > As you say, it doesn't work properly. As a matter of fact, guessing
> > charsets, like timezones, is not possible. You need to know which
> > charset something is in. If not, you need to address *that* problem.
>
>  Agreed.


The problem is, developers are going to write code to guess character sets.

Ironically, PHPUnit attempts to detect UTF-8
.
There is also no shortage of SO posts explaining other approaches. My
favorite is using a preg_match trick
.

I'd rather we include the patch for a few reasons:

1. so that there's a modern "standard" method of doing so, and that
"standard" method has plenty of documentation that points people to the
limitations.
2. to completely expose the underlying ICU, rather than arbitrarily
deciding one part isn't good for developers to use.
3. to provide an alternative to mb_detect_encoding.

While I can't say if this will or won't cause more user confusion, I do
believe this adds value: ICU provides a confidence metric, which no other
in-built or buildable solution (to my knowledge) provides.


Re: [PHP-DEV] IntlCharsetDetector

2016-04-07 Thread Sebastian Bergmann
Am 05.04.2016 um 11:05 schrieb Derick Rethans:
> I would advice against adding this.
> 
> As you say, it doesn't work properly. As a matter of fact, guessing 
> charsets, like timezones, is not possible. You need to know which 
> charset something is in. If not, you need to address *that* problem.

 Agreed.

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] IntlCharsetDetector

2016-04-05 Thread Derick Rethans
On Mon, 4 Apr 2016, Sara Golemon wrote:

> The subject of character set detection (yes, I know, a hard problem to
> solve) came up on SO chat, and Niki noticed that we don't yet wrap the
> ICU UCharsetDetector API so I volunteered to put something together.
> 
> https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector
> 
> The trouble is, for the WIDE majority of my test cases so far, ICU is
> really bad at detecting character sets correctly (as I said, it's a
> tough problem).  In fact, the ICU manual admits that it doesn't even
> look at all of the corpus text, and the "language detection" is a
> byproduct not meant for actual language detection.
> 
> Given all that, I'm inclined to reject the idea of rolling this into
> PHP for fear of just confusing users without actually adding any
> value.
> 
> Thoughts?

I would advice against adding this.

As you say, it doesn't work properly. As a matter of fact, guessing 
charsets, like timezones, is not possible. You need to know which 
charset something is in. If not, you need to address *that* problem.

cheers,
Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-DEV] IntlCharsetDetector

2016-04-04 Thread Sara Golemon
The subject of character set detection (yes, I know, a hard problem to
solve) came up on SO chat, and Niki noticed that we don't yet wrap the
ICU UCharsetDetector API so I volunteered to put something together.

https://github.com/php/php-src/compare/master...sgolemon:intl.charsetdetector

The trouble is, for the WIDE majority of my test cases so far, ICU is
really bad at detecting character sets correctly (as I said, it's a
tough problem).  In fact, the ICU manual admits that it doesn't even
look at all of the corpus text, and the "language detection" is a
byproduct not meant for actual language detection.

Given all that, I'm inclined to reject the idea of rolling this into
PHP for fear of just confusing users without actually adding any
value.

Thoughts?

-Sara

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php