Character set cluelessness

2012-10-02 Thread Doug Ewell
The United Nations Economic Commission for Europe (UNECE) has released a
new version of UN/LOCODE, and their Secretariat Note document is just as
clueless as ever about character set usage in international standards:

Place names in UN/LOCODE are given in their national language versions
as expressed in the Roman alphabet using the 26 characters of the
character set adopted for international trade data interchange, with
diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
3.3.2] of the UN/LOCODE Manual). International ISO Standard character
sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
standard United States character set (437), which conforms to these ISO
standards, is also widely used in trade data interchange).

It's 2012. How does one get through to folks like this? I tried writing
to them a few years ago, but I don't think they were impressed by an
individual contribution.

http://www.unece.org/cefact/locode/welcome.html

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­





Re: Character set cluelessness

2012-10-02 Thread john knightley
Sad to say this seems to be close to the norm for all to many large
organizations where if it isn't in the 1990's version of the Times Roman
font then it's out.

John
On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote:

 The United Nations Economic Commission for Europe (UNECE) has released a
 new version of UN/LOCODE, and their Secretariat Note document is just as
 clueless as ever about character set usage in international standards:

 Place names in UN/LOCODE are given in their national language versions
 as expressed in the Roman alphabet using the 26 characters of the
 character set adopted for international trade data interchange, with
 diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
 3.3.2] of the UN/LOCODE Manual). International ISO Standard character
 sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
 standard United States character set (437), which conforms to these ISO
 standards, is also widely used in trade data interchange).

 It's 2012. How does one get through to folks like this? I tried writing
 to them a few years ago, but I don't think they were impressed by an
 individual contribution.

 http://www.unece.org/cefact/locode/welcome.html

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­






RE: Character set cluelessness

2012-10-02 Thread Jonathan Rosenne
I don't agree with the criticism. These place name are there to be readable
by a wide audience, rather than writable by locals and specialists. They
require the lowest common denominator.

 

Jony

 

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
Behalf Of john knightley
Sent: Tuesday, October 02, 2012 6:35 PM
To: Doug Ewell
Cc: unicode@unicode.org; loc...@unece.org
Subject: Re: Character set cluelessness

 

Sad to say this seems to be close to the norm for all to many large
organizations where if it isn't in the 1990's version of the Times Roman
font then it's out. 

John

On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote:

The United Nations Economic Commission for Europe (UNECE) has released a
new version of UN/LOCODE, and their Secretariat Note document is just as
clueless as ever about character set usage in international standards:

Place names in UN/LOCODE are given in their national language versions
as expressed in the Roman alphabet using the 26 characters of the
character set adopted for international trade data interchange, with
diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
3.3.2] of the UN/LOCODE Manual). International ISO Standard character
sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
standard United States character set (437), which conforms to these ISO
standards, is also widely used in trade data interchange).

It's 2012. How does one get through to folks like this? I tried writing
to them a few years ago, but I don't think they were impressed by an
individual contribution.

http://www.unece.org/cefact/locode/welcome.html

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell -






RE: Character set cluelessness

2012-10-02 Thread Doug Ewell
Jonathan Rosenne jonathan dot rosenne at gmail dot com wrote:

 I don't agree with the criticism. These place name are there to be
 readable by a wide audience, rather than writable by locals and
 specialists. They require the lowest common denominator.

I don't mind so much if they have to maintain an ASCII-only name field,
or a Latin-1-only field. But referencing the 1993 version of ISO
10646-1, or claiming that MS-DOS code page 437 is the standard United
States character set in 2012 and that it conforms to 8859-1 and
10646, helps nobody.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­





Re: Character set cluelessness

2012-10-02 Thread Michael Everson
On 2 Oct 2012, at 21:40, Doug Ewell wrote:

 But referencing the 1993 version of ISO 10646-1, or claiming that MS-DOS code 
 page 437 is the standard United States character set in 2012 and that it 
 conforms to 8859-1 and 10646, helps nobody.

It's not that it helps nobody. It's just that it's WRONG.

Michael Everson * http://www.evertype.com/





Re: Character set cluelessness

2012-10-02 Thread Mark Davis ☕
I tend to agree. What would be useful is to have one column for the city in
the local language (or more columns for multilingual cities), but it is
extremely useful to have an ASCII version as well.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne jonathan.rose...@gmail.com
 wrote:

 I don't agree with the criticism. These place name are there to be
 readable by a wide audience, rather than writable by locals and
 specialists. They require the lowest common denominator.

 ** **

 Jony

 ** **

 *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On
 Behalf Of *john knightley
 *Sent:* Tuesday, October 02, 2012 6:35 PM
 *To:* Doug Ewell
 *Cc:* unicode@unicode.org; loc...@unece.org
 *Subject:* Re: Character set cluelessness

 ** **

 Sad to say this seems to be close to the norm for all to many large
 organizations where if it isn't in the 1990's version of the Times Roman
 font then it's out. 

 John

 On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote:

 The United Nations Economic Commission for Europe (UNECE) has released a
 new version of UN/LOCODE, and their Secretariat Note document is just as
 clueless as ever about character set usage in international standards:

 Place names in UN/LOCODE are given in their national language versions
 as expressed in the Roman alphabet using the 26 characters of the
 character set adopted for international trade data interchange, with
 diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
 3.3.2] of the UN/LOCODE Manual). International ISO Standard character
 sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
 standard United States character set (437), which conforms to these ISO
 standards, is also widely used in trade data interchange).

 It's 2012. How does one get through to folks like this? I tried writing
 to them a few years ago, but I don't think they were impressed by an
 individual contribution.

 http://www.unece.org/cefact/locode/welcome.html

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­


 



RE: Character set cluelessness

2012-10-02 Thread Doug Ewell
Mark Davis  mark at macchiato dot com wrote:

 I tend to agree. What would be useful is to have one column for the
 city in the local language (or more columns for multilingual cities),
 but it is extremely useful to have an ASCII version as well.

They have two name fields, one (Name) for the name transliterated into
Latin, and a second (NameWoDiacritics) which is an ASCII-smashed
version of the first. Again, that's fine as long as I am free to ignore
the ASCII version. They don't attempt to represent names in non-Latin
scripts, which is not my beef here.

There are many names in the Name (i.e. beyond ASCII) field that
include characters beyond 8859-1, such as œ and ̆z, and certainly many
beyond CP437. This is a good thing (although there are some errors, not
as many as in past years), but they need to fix their documentation to
reflect what they actually do, and not make these irrelevant,
misleading, and/or inaccurate references to 437 and to a 19-year-old
version of 10646.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell shy;




Re: Character set cluelessness

2012-10-02 Thread Richard Wordingham
On Tue, 2 Oct 2012 13:49:57 -0700
Mark Davis ☕ m...@macchiato.com wrote:

 I tend to agree. What would be useful is to have one column for the
 city in the local language (or more columns for multilingual cities),
 but it is extremely useful to have an ASCII version as well.

Academic Romanisation would be useful as well!

Richard.




Re: Character set cluelessness

2012-10-02 Thread Mark Davis ☕
Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ m...@macchiato.com wrote:

 I tend to agree. What would be useful is to have one column for the city
 in the local language (or more columns for multilingual cities), but it is
 extremely useful to have an ASCII version as well.

 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne 
 jonathan.rose...@gmail.com wrote:

 I don't agree with the criticism. These place name are there to be
 readable by a wide audience, rather than writable by locals and
 specialists. They require the lowest common denominator.

 ** **

 Jony

 ** **

 *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On
 Behalf Of *john knightley
 *Sent:* Tuesday, October 02, 2012 6:35 PM
 *To:* Doug Ewell
 *Cc:* unicode@unicode.org; loc...@unece.org
 *Subject:* Re: Character set cluelessness

 ** **

 Sad to say this seems to be close to the norm for all to many large
 organizations where if it isn't in the 1990's version of the Times Roman
 font then it's out. 

 John

 On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote:

 The United Nations Economic Commission for Europe (UNECE) has released a
 new version of UN/LOCODE, and their Secretariat Note document is just as
 clueless as ever about character set usage in international standards:

 Place names in UN/LOCODE are given in their national language versions
 as expressed in the Roman alphabet using the 26 characters of the
 character set adopted for international trade data interchange, with
 diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
 3.3.2] of the UN/LOCODE Manual). International ISO Standard character
 sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
 standard United States character set (437), which conforms to these ISO
 standards, is also widely used in trade data interchange).

 It's 2012. How does one get through to folks like this? I tried writing
 to them a few years ago, but I don't think they were impressed by an
 individual contribution.

 http://www.unece.org/cefact/locode/welcome.html

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­


 





Re: Character set cluelessness

2012-10-02 Thread Mark Davis ☕
And just to be clear, I do agree that their documentation of the standards
usage, well, needs improvement. I'm just talking about the actual data, and
for that as a practical matter it is valuable to have both the native
language version(s) of a name, and a Latin equivalent.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Oct 2, 2012 at 2:52 PM, Mark Davis ☕ m...@macchiato.com wrote:

 Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm

 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ m...@macchiato.com wrote:

 I tend to agree. What would be useful is to have one column for the city
 in the local language (or more columns for multilingual cities), but it is
 extremely useful to have an ASCII version as well.

 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne 
 jonathan.rose...@gmail.com wrote:

 I don't agree with the criticism. These place name are there to be
 readable by a wide audience, rather than writable by locals and
 specialists. They require the lowest common denominator.

 ** **

 Jony

 ** **

 *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On
 Behalf Of *john knightley
 *Sent:* Tuesday, October 02, 2012 6:35 PM
 *To:* Doug Ewell
 *Cc:* unicode@unicode.org; loc...@unece.org
 *Subject:* Re: Character set cluelessness

 ** **

 Sad to say this seems to be close to the norm for all to many large
 organizations where if it isn't in the 1990's version of the Times Roman
 font then it's out. 

 John

 On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote:

 The United Nations Economic Commission for Europe (UNECE) has released a
 new version of UN/LOCODE, and their Secretariat Note document is just as
 clueless as ever about character set usage in international standards:

 Place names in UN/LOCODE are given in their national language versions
 as expressed in the Roman alphabet using the 26 characters of the
 character set adopted for international trade data interchange, with
 diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
 3.3.2] of the UN/LOCODE Manual). International ISO Standard character
 sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
 standard United States character set (437), which conforms to these ISO
 standards, is also widely used in trade data interchange).

 It's 2012. How does one get through to folks like this? I tried writing
 to them a few years ago, but I don't think they were impressed by an
 individual contribution.

 http://www.unece.org/cefact/locode/welcome.html

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­


 






Re: Character set cluelessness

2012-10-02 Thread Richard Wordingham
On Tue, 02 Oct 2012 09:14:08 -0700
Doug Ewell d...@ewellic.org wrote:

 It's 2012. How does one get through to folks like this?

Even people who should know better can get confused about character
sets.  Does anyone know what 'a complex script Unicode range' is?  It's
a term that occurs in the Office Open XML specification, but I
can't find a definition for it.

It's just possible that it means a range where hypothetically unassigned
characters would not be left-to-right, but I've a feeling it ought to
include Vietnamese characters for all that they're Latin script.

Possibly the definitions have not been provided because the concept
ought to involve the tricky task of breaking text runs into script
runs.  (Lots of people feel one should be able to add script-specific
combining marks to U+25CC DOTTED CIRCLE, U+2013 EN DASH and U+00D7
MULTIPLICATION SIGN or perhaps even U+0078 LATIN SMALL LETTER X.
U+0964 DEVANAGARI DANDA is used with the Latin, Devanagari and Tamil
scripts, to name but a few.)

Richard.




Re: Character set cluelessness

2012-10-02 Thread Martin J. Dürst
Richard - Complex script usually refers to scripts where rendering isn't 
just simply putting glyphs side by side. That includes stuff with 
combining marks, ligatures, reordering, stacking, and the like.


Regards,   Martin.

On 2012/10/03 7:09, Richard Wordingham wrote:

On Tue, 02 Oct 2012 09:14:08 -0700
Doug Ewelld...@ewellic.org  wrote:


It's 2012. How does one get through to folks like this?


Even people who should know better can get confused about character
sets.  Does anyone know what 'a complex script Unicode range' is?  It's
a term that occurs in the Office Open XML specification, but I
can't find a definition for it.

It's just possible that it means a range where hypothetically unassigned
characters would not be left-to-right, but I've a feeling it ought to
include Vietnamese characters for all that they're Latin script.

Possibly the definitions have not been provided because the concept
ought to involve the tricky task of breaking text runs into script
runs.  (Lots of people feel one should be able to add script-specific
combining marks to U+25CC DOTTED CIRCLE, U+2013 EN DASH and U+00D7
MULTIPLICATION SIGN or perhaps even U+0078 LATIN SMALL LETTER X.
U+0964 DEVANAGARI DANDA is used with the Latin, Devanagari and Tamil
scripts, to name but a few.)

Richard.







Re: Character set cluelessness

2012-10-02 Thread Martin J. Dürst
So in order to get something going here, why doesn't Doug draft a letter 
to these guys (possibly based on the one from a few years ago) and then 
Mark sends it off in his position at Unicode, which hopefully will 
impress them more than just a personal contribution.


Being upset in this list (which I'm too, of course) doesn't change anything.

Regards,   Martin.

On 2012/10/03 6:15, Doug Ewell wrote:

Mark Davis mark at macchiato dot com  wrote:


I tend to agree. What would be useful is to have one column for the
city in the local language (or more columns for multilingual cities),
but it is extremely useful to have an ASCII version as well.


They have two name fields, one (Name) for the name transliterated into
Latin, and a second (NameWoDiacritics) which is an ASCII-smashed
version of the first. Again, that's fine as long as I am free to ignore
the ASCII version. They don't attempt to represent names in non-Latin
scripts, which is not my beef here.

There are many names in the Name (i.e. beyond ASCII) field that
include characters beyond 8859-1, such as œ and ̆z, and certainly many
beyond CP437. This is a good thing (although there are some errors, not
as many as in past years), but they need to fix their documentation to
reflect what they actually do, and not make these irrelevant,
misleading, and/or inaccurate references to 437 and to a 19-year-old
version of 10646.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwellshy;







The term complex script in OOXML spec (Re: [unicode] Re: Character set cluelessness)

2012-10-02 Thread suzuki toshiya

Dear Richard,

There had been long discussion about OOXML's complex script in JTC1/SC34,
since 2009.

https://skydrive.live.com/view.aspx/Public%20Documents/2009/DR-09-0040.docx?cid=c8ba0861dc5e4adcsc=documents
I expect the next corrigendum or amendment will describe more about it.
Unfortunately, the implementation referred by complex script would be
different what Unicode experts remind from this word.

Regards,
mpsuzuki

Martin J. Dürst wrote (2012/10/03 9:37):
Richard - Complex script usually refers to scripts where rendering isn't 
just simply putting glyphs side by side. That includes stuff with 
combining marks, ligatures, reordering, stacking, and the like.


Regards,   Martin.

On 2012/10/03 7:09, Richard Wordingham wrote:

On Tue, 02 Oct 2012 09:14:08 -0700
Doug Ewelld...@ewellic.org  wrote:


It's 2012. How does one get through to folks like this?


Even people who should know better can get confused about character
sets.  Does anyone know what 'a complex script Unicode range' is?  It's
a term that occurs in the Office Open XML specification, but I
can't find a definition for it.

It's just possible that it means a range where hypothetically unassigned
characters would not be left-to-right, but I've a feeling it ought to
include Vietnamese characters for all that they're Latin script.

Possibly the definitions have not been provided because the concept
ought to involve the tricky task of breaking text runs into script
runs.  (Lots of people feel one should be able to add script-specific
combining marks to U+25CC DOTTED CIRCLE, U+2013 EN DASH and U+00D7
MULTIPLICATION SIGN or perhaps even U+0078 LATIN SMALL LETTER X.
U+0964 DEVANAGARI DANDA is used with the Latin, Devanagari and Tamil
scripts, to name but a few.)

Richard.