Re: Review of changes to Web compat-sensitive prefs in localizations

2013-08-28 Thread Anne van Kesteren
On Wednesday, February 27, 2013 12:28:43 PM UTC, Axel Hecht wrote:
 That's rather orthogonal to what you're currently trying to do, but it's 
 also indicating to me that we should remove all of those settings from 
 intl.properties, and just leave accept-lang, and deduce the rest.

So how about the parser just accepts a locale value and implements the 
locale-to-fallback encoding map? Given the numerous problems discovered[1], 
locale-defaults actually being part of the HTML Standard, and it being 
available as option to change encourages people to tweak it, I think that would 
be a better way forward.

I wonder if there are similar settings that are in a sense too technical to 
leave up to localization teams.


[1]Recent issues discovered by hsivonen:
* https://bugzilla.mozilla.org/show_bug.cgi?id=910163
* https://bugzilla.mozilla.org/show_bug.cgi?id=910165
* https://bugzilla.mozilla.org/show_bug.cgi?id=910169 (bogus value, even)
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Review of changes to Web compat-sensitive prefs in localizations

2013-08-28 Thread Henri Sivonen
On Wed, Aug 28, 2013 at 3:33 PM, Henri Sivonen hsivo...@hsivonen.fi wrote:

  If I were starting such a research project, I'd start by testing
 hypotheses about TLD correlation with legacy encodings. The first thing I'd
 like to test would be whether it would be an improvement to make builds
 that have Traditional Chinese as the UI language use gbk (as opposed to
 big5) as the fallback encoding when browsing content loaded from a .cn
 domain.


To elaborate, we could first have a lookup table from country TLDs to
legacy encodings and then only as a second step would use the lookup from
the UI localization to legacy encodings for TLDs  that don't have a strong
country affiliation. So for example, we'd map .cn to gbk, .tw to big5, .ru
to windows-1251 and .de, .fr, .se, .nl, .fi etc. to windows-1252, but for
.com, .org and such we'd base the guess on the UI locale like today but
using a less brittle way of managing the mapping.

But anyway, that would be improving the guessing instead of just fixing how
the current guessing mechanism is a managed. I don't want better to be a
blocker for good here.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.iki.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Review of changes to Web compat-sensitive prefs in localizations

2013-08-28 Thread Henri Sivonen
On Wed, Aug 28, 2013 at 3:46 PM, Henri Sivonen hsivo...@hsivonen.fi wrote:

 On Wed, Aug 28, 2013 at 3:33 PM, Henri Sivonen hsivo...@hsivonen.fiwrote:

  If I were starting such a research project, I'd start by testing
 hypotheses about TLD correlation with legacy encodings. The first thing I'd
 like to test would be whether it would be an improvement to make builds
 that have Traditional Chinese as the UI language use gbk (as opposed to
 big5) as the fallback encoding when browsing content loaded from a .cn
 domain.


 To elaborate, we could first have a lookup table from country TLDs to
 legacy encodings and then only as a second step would use the lookup from
 the UI localization to legacy encodings for TLDs  that don't have a strong
 country affiliation. So for example, we'd map .cn to gbk, .tw to big5, .ru
 to windows-1251 and .de, .fr, .se, .nl, .fi etc. to windows-1252, but for
 .com, .org and such we'd base the guess on the UI locale like today but
 using a less brittle way of managing the mapping.


Filed as: https://bugzilla.mozilla.org/show_bug.cgi?id=910211

-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.iki.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Review of changes to Web compat-sensitive prefs in localizations

2013-02-27 Thread Henri Sivonen
On Fri, Feb 22, 2013 at 8:03 PM, Axel Hecht l...@mozilla.com wrote:
 On 22.02.13 18:41, Henri Sivonen wrote:

 On Feb 22, 2013 5:30 PM, Axel Hecht l...@mozilla.com wrote:

 There's just no other way than post-mortem work. That's one of the

 reasons why we're not taking arbitrary changesets to ship to any audience
 beyond aurora and nightly, for beta and release, we got to have technical
 checks in place.

 Where should I file bugs to add checks to this set of checks?


 Not sure which checks you're talking about, so I can't really tell what you
 want.

I meant checks like flagging attempts to go to beta with either of the
following:
 * Detector pref not being blank except for a specific white list of
particular values for the ru, uk, ja, ja-JP-Mac and zh-TW locales.
 * Fallback charset set to UTF-8 in any locale that doesn't already
have it set to UTF-8.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Review of changes to Web compat-sensitive prefs in localizations

2013-02-27 Thread Axel Hecht

On 27.02.13 09:30, Henri Sivonen wrote:

On Fri, Feb 22, 2013 at 8:03 PM, Axel Hecht l...@mozilla.com wrote:

On 22.02.13 18:41, Henri Sivonen wrote:


On Feb 22, 2013 5:30 PM, Axel Hecht l...@mozilla.com wrote:


There's just no other way than post-mortem work. That's one of the


reasons why we're not taking arbitrary changesets to ship to any audience
beyond aurora and nightly, for beta and release, we got to have technical
checks in place.

Where should I file bugs to add checks to this set of checks?



Not sure which checks you're talking about, so I can't really tell what you
want.


I meant checks like flagging attempts to go to beta with either of the
following:
  * Detector pref not being blank except for a specific white list of
particular values for the ru, uk, ja, ja-JP-Mac and zh-TW locales.
  * Fallback charset set to UTF-8 in any locale that doesn't already
have it set to UTF-8.



I'm doing a source-based review, which at least catches regressions to 
those settings.


And I think we're doing charset detector settings wrong. Let me see if I 
get right what we're doing:


- most content should be labeled for charset
- if not, let's see if we can guess the encoding
-- if we assume the language of the content, we can guess better
-- many languages really only have one option
-- ru, uk, ja, zh-TW do have options, so we use a charset detector

Now, I don't think it's right to use the UI language to guess content 
language. We have a list of user-preferred languages (with good defaults 
based on UI language). We should go through that list, and pick charsets 
to try for unlabeled content from there.


That's rather orthogonal to what you're currently trying to do, but it's 
also indicating to me that we should remove all of those settings from 
intl.properties, and just leave accept-lang, and deduce the rest.


You also mentioned in the bug that you didn't get the OK to use 
telemetry to gather further data. I think if we just collect the data 
about the charset optimization and how good it's doing, we should be OK. 
I.e., at the point where the locale doesn't matter, but just cp-1252 
etc, the entropy goes up a good deal. In particular for small locales. 
I'd argue that this might even make sense to be part of health report.


Axel
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Review of changes to Web compat-sensitive prefs in localizations

2013-02-22 Thread Axel Hecht

On 22.02.13 15:37, Henri Sivonen wrote:

I've been finding and, to a lesser extent, reporting and writing
patches for bugs where a localization sets the fallback encoding to a
value that doesn't suit the purpose of the fallback. In some cases,
there such bogosity in the intl.properties file (e.g. translation of
the word windows as part of a charset label) that I suspect that
changes to intl.properties have been landing without review.

I propose we adopt a rule that says that localizations need review
from the HTML parser module owner (i.e. me) to change the values of
preferences that modify the behavior of the HTML parser. (In practice,
this means the localizable properties intl.charset.default and
intl.charset.detector.)

Opinions?



I don't think that .platform is the right group to discuss policies for 
l10n, tbh.


Anyway, I don't think that it requires your review. For one, these rules 
just don't work in practice. We're facing the very same problem with 
search engines. There's just no other way than post-mortem work. That's 
one of the reasons why we're not taking arbitrary changesets to ship to 
any audience beyond aurora and nightly, for beta and release, we got to 
have technical checks in place.


I usually catch regressions to intl.properties when reviewing requests 
for updates to those changesets.


That said, I don't know what intl.charset.detector should be set to, 
aside from nothing. Looking at your patch, the comment doesn't make that 
clearer, too, I'll follow up there.


Axel
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Review of changes to Web compat-sensitive prefs in localizations

2013-02-22 Thread Henri Sivonen
On Feb 22, 2013 5:30 PM, Axel Hecht l...@mozilla.com wrote:
 There's just no other way than post-mortem work. That's one of the
reasons why we're not taking arbitrary changesets to ship to any audience
beyond aurora and nightly, for beta and release, we got to have technical
checks in place.

Where should I file bugs to add checks to this set of checks?
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Review of changes to Web compat-sensitive prefs in localizations

2013-02-22 Thread L. David Baron
On Friday 2013-02-22 16:37 +0200, Henri Sivonen wrote:
 I've been finding and, to a lesser extent, reporting and writing
 patches for bugs where a localization sets the fallback encoding to a
 value that doesn't suit the purpose of the fallback. In some cases,
 there such bogosity in the intl.properties file (e.g. translation of
 the word windows as part of a charset label) that I suspect that
 changes to intl.properties have been landing without review.

It might not be a bad idea to have a better explanation in
http://mxr.mozilla.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/intl.properties
of why one would want to change intl.charset.default and
intl.charset.detector, explaining clearly that they should only be
set to interesting values to deal with a substantial body of
legacy content that requires those values, and then saying what they
should be in the absence of such legacy content (the latter should
clearly be empty; I'm not sure whether the former should be UTF-8 or
ISO-8859-1, but we should have a consistent policy).

That said, I don't actually know whether the tools localizers use to
do localization lead them to read the text.

The reality is that I suspect it may be important for you to do
occasional audits of these values; it could also be valuable to have
a tool that exposes all of them in a single place (perhaps even a
place with history, like an automatically-generated wiki page).

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla   http://www.mozilla.org/   턂
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Review of changes to Web compat-sensitive prefs in localizations

2013-02-22 Thread Axel Hecht

On 22.02.13 20:02, L. David Baron wrote:

On Friday 2013-02-22 16:37 +0200, Henri Sivonen wrote:

I've been finding and, to a lesser extent, reporting and writing
patches for bugs where a localization sets the fallback encoding to a
value that doesn't suit the purpose of the fallback. In some cases,
there such bogosity in the intl.properties file (e.g. translation of
the word windows as part of a charset label) that I suspect that
changes to intl.properties have been landing without review.


It might not be a bad idea to have a better explanation in
http://mxr.mozilla.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/intl.properties
of why one would want to change intl.charset.default and
intl.charset.detector, explaining clearly that they should only be
set to interesting values to deal with a substantial body of
legacy content that requires those values, and then saying what they
should be in the absence of such legacy content (the latter should
clearly be empty; I'm not sure whether the former should be UTF-8 or
ISO-8859-1, but we should have a consistent policy).

That said, I don't actually know whether the tools localizers use to
do localization lead them to read the text.

The reality is that I suspect it may be important for you to do
occasional audits of these values; it could also be valuable to have
a tool that exposes all of them in a single place (perhaps even a
place with history, like an automatically-generated wiki page).

-David



Henri filed https://bugzilla.mozilla.org/show_bug.cgi?id=844042 before 
posting here (or at least around the same time).


Axel
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform