from:"Martin J. Dürst"

Re: [indic] Indian Rupee symbol

2010-07-16 Thread Martin J. Dürst




On 2010/07/16 16:34, Michael Everson wrote:

A proposal to add the character to the Unicode Standard and ISO/IEC 10646 was 
published yesterday. See http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3862.pdf



The shape of the currency sign has been specified as “an amalgam” of the 
DEVANAGARI LETTER RA, and

the LATIN CAPITAL LETTER RA


LATIN CAPITAL LETTER RA? Shouldn't that be LATIN CAPITAL LETTER R?

Regards,Martin.


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Reasonable to propose stability policy on numeric type = decimal

2010-07-25 Thread Martin J. Dürst




On 2010/07/26 4:37, Asmus Freytag wrote:


PPS: a very hypothetical tough case would be a script where letters
serve both as letters and as decimal place-value digits, and with modern
living practice.


Well, there actually is such a script, namely Han. The digits (一、二、 
三、四、五、六、七、八、九、〇) are used both as letters and as decimal 
place-value digits, and they are scattered widely, and of course there 
are is a lot of modern living practice.


Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Reasonable to propose stability policy on numeric type = decimal

2010-07-28 Thread Martin J. Dürst




On 2010/07/28 0:36, John Dlugosz wrote:


I can imagine supporting national representations for numbers for outputting 
reports,
but I don't imagine anyone writing in a programming language would be compelled 
to
type 四佰六十 instead of 560.


Well, indeed, I hope nobody would do that. 四佰六十 would be 460, and 
560 would be 五佰六十 :-).


Regards,   Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Reasonable to propose stability policy on numeric type = decimal

2010-07-28 Thread Martin J. Dürst




On 2010/07/29 13:33, karl williamson wrote:

Asmus Freytag wrote:

On 7/25/2010 6:05 PM, Martin J. Dürst wrote:



Well, there actually is such a script, namely Han. The digits (一、
二、三、四、五、六、七、八、九、〇) are used both as letters and as
decimal place-value digits, and they are scattered widely, and of
course there are is a lot of modern living practice.



The situation is worse than you indicate, because the same characters
are also used as elements in a system that doesn't use place-value,
but uses special characters to show powers of 10.


Is it the case that a sequence of just these characters, without any
intervening characters, and not adjacent to the special characters you
mention always mean a place-value decimal number?


No. Sequences of numeric Kanji are also used in names and word-plays, 
and as sequences of individual small numbers.


But the same applies to our digits. A very simple example is to use them 
as a ruler in plain text:


 1 2 3 4 5 6 7
1234567890123456789012345678901234567890123456789012345678901234567890


Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: High dot/dot above punctuation?

2010-07-29 Thread Martin J. Dürst




On 2010/07/29 19:51, Juanma Barranquero wrote:

On Thu, Jul 29, 2010 at 10:15, Khaled Hosnykhaledho...@eglug.org  wrote:


Also, I don't buy in Unicode idea of
encoding different sets of decimal digits separately, they are all
different graphical presentations of the same thing.


Not in a document where the author is discussing the differences
between them, for example.


The where the author is discussing the differences doesn't help in 
deciding whether to encode one or two characters. A document may discuss 
the roman and italic versions of a character, or the Times and Palatino 
versions of a character, or different versions of Times fonts for the 
same character, and so on. It's very clear that we would get nowhere if 
we wanted to encode all these.


In simpler words, you cannot use the needs of discussions about encoding 
(the meta-level) to determine encodings.


Regards,Martin.


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: High dot/dot above punctuation?

2010-07-29 Thread Martin J. Dürst


Hello Joanma,

On 2010/07/30 12:05, Juanma Barranquero wrote:

On Fri, Jul 30, 2010 at 04:52, Martin J. Dürstdue...@it.aoyama.ac.jp  wrote:


It's very clear that we would get nowhere if we wanted to encode
all these.


The comment I respondend to talked about characters that are already encoded.


Sorry, I didn't get that.


In simpler words, you cannot use the needs of discussions about encoding
(the meta-level) to determine encodings.


Discussing arabic versus latin numerals is not more meta-level that
talking about upper vs. lowercase.


Yes indeed. If these distinctions were only necessary when talking 
*about* these characters (meta-level) rather than when just using them 
(non-meta), then I would indeed agree that there is no reason to encode 
them separately.


Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Most complete (free) Chinese font?

2010-08-02 Thread Martin J. Dürst


Hello Michael,

I hope you still remember that I am one of the (apparently very few) 
people who paid for Everson Mono. That was more than ten years ago.


On 2010/08/03 1:02, Michael Everson wrote:

On 2 Aug 2010, at 13:10, Leonardo Boiko wrote:


When did I say there was something shameful about non-freeness? I only said, 
and I quote, that it’s not my thing.


I find the term non-free to smack of élitism and a view that commerce is 
undesirable. And I'm not even very good at being a merchant.


Instead of criticising a term, would you mind proposing a different term?



It’s much simpler, for me, to stick to an automated system that guarantees 
freedom.


Indeed? Let us weep for those benighted folks who shackled themselves to the 
world of pecuniary transaction by choosing to render a shareware fee for 
Everson Mono


Nobody has to weep for me. I actually haven't used Everson Mono much, 
I'm not even sure whether I ever used it, but at the time I found the 
idea that somebody was working on a font that covered Unicode really 
worthy of support.


Regards,Martin.


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Results of public Review Issues (in particular #121)

2010-08-03 Thread Martin J. Dürst

Dear Unicode Experts,

In a discussion about a new protocol, there was some issue about how to
replace illegal bytes in UTF-8 with U+FFFD. That let me remember that
there was once a Public Review Issue about this, and that as a result, I
added something to the Ruby (programming language) codebase. I traced
this back to the method test_public_review_issue_121 added at
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?r1=18291r2=18290pathrev=18291,
and from there to http://www.unicode.org/review/pr-121.html.

What I now would like to know is what became of the UTC tentative
preference for option #2, and where this is documented, and if
possible, which other programming languages and libraries use or don't
use this preference.

On a higher level, this also suggests that it would be very good to add
a bit more of (meta)data to these review issues, such as date opened and
date closed and resolution.

After manipulating the URI a bit, I got to
http://www.unicode.org/review/ and from there to
http://www.unicode.org/review/resolved-pri-100.html, where I can find:

Resolution: Closed 2008-08-29. The UTC decided to adopt option 2 of the PRI.

This should be directly linked from
http://www.unicode.org/review/pr-121.html (or just put that information
on that page). Also, I'm still interested about where the result of this
resolution is nailed down (a new version of the standard, with chapter
and verse, or a TR or some such.

Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-06 Thread Martin J. Dürst




On 2010/08/05 2:56, Asmus Freytag wrote:

On 8/2/2010 5:04 PM, Karl Pentzlin wrote:

I have compiled a draft proposal:
Proposal to add Variation Sequences for Latin and Cyrillic letters
The draft can be downloaded at:
http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
The final proposal is intended to be submitted for the next UTC
starting next Monday (August 9).

Any comments are welcome.

- Karl Pentzlin


This is an interesting proposal to deal with the glyph selection problem
caused by the unification process inherent in character encoding.

When Unicode was first contemplated, the web did not exist and the
expectation was that it would nearly always be possible to specify the
font to be used for a given text and that selecting a font would give
the correct glyph.


The Web may finally get to solve this problem, although it may still 
take some time to be fully deployed. Please see http://www.w3.org/Fonts/ 
for more details and pointers.


Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Proposed Update Unicode Technical Standard #46 (Unicode IDNA Compatibility Processing)

2010-09-24 Thread Martin J. Dürst


On 2010/09/23 5:10, Markus Scherer wrote:


No mistake here: The 63-octet limitation only applies when generating a
string for the DNS lookup, that is, in the ToASCII operation. It makes no
sense to count DNS octets in a ToUnicode result.
The test file has the appropriate error code for the ToASCII result, and the
normal string result for ToUnicode.


Yes indeed. For some actual examples of very long URIs (which actualy 
resolve), see tests 121 (single long label) and 122 at

http://www.w3.org/2004/04/uri-rel-test.html.

Also, for a discussion of potential length limits in IDNAbis on Unicode 
strings (which, for many good reasons, were ultimately rejected), please 
see the discussion around

http://lists.w3.org/Archives/Public/public-iri/2009Sep/0064.html.

Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: First posting to list: Unicode.org: unicode - punycode converter tool?

2010-10-30 Thread Martin J. Dürst


On 2010/10/30 9:17, Markus Scherer wrote:

On Fri, Oct 29, 2010 at 3:57 PM, JP Blankert (thuis  PC based)
jpblank...@zonnet.nl  wrote:


Dear unicode.org interested,

I discovered at least 1 flaw in the converter tools I used so far (as
Verisign's IDN to punycode converter): none of the ones I checkes recognises
the German character

ß

(the sz, as from 'Straße' )

correctly, the sign is always dissolved in ss.



This is standard IDNA2003 behavior.


Yes.


It is usually desirable


It is desirable in searching, but it wasn't desirable in domain names. 
The reason it got into IDNA2003 is because the IETF was looking for data 
to do case mapping beyond ASCII, and the data available from the Unicode 
consortium included the 'ß' - ss mapping, and the IETF didn't want to 
change it because they feared that might start all kinds of discussions 
on all kinds of (essentially unrelated) issues.



because a) many
German speakers are unsure about when exactly to use ß vs. ss,


Yes, but for many names, it's either one or the other. Essentially, no 
rules.



b) the
spelling reform a few years ago changed the rules,


Yes. They got way easier and more straightforward.


and c) Switzerland does
not use ß at all in German.


Yes. But that's no reason to take it away from those who use it.
(at least myself being Swiss I don't think so)


This means that for most purposes it is
counter-productive (and can be a security risk) to distinguish ß and ss.


Well, it can be a security risk to distinguish between 'i' and 'l' and 
'1', and so on, and nevertheless, it's being done for good reasons all 
the time.



IDNA2008, an incompatible update, by itself does not map characters.


What's more important, IDNA2008 allows the 'ß' as is.


UTS #46
provides a compatibility bridge for both IDNA2003 and IDNA2008, and the ß
behavior is an option there.


Yes. The basic idea in TR #46 is that in a first phase, 'ß' is mapped to 
'ss' for lookup, to give registries with German clients a chance to 
their clients to register true 'ß' where necessary. After that, the 
mapping can be dropped, so as in the (somewhat distant) future to allow 
for cases where a name with 'ß' and a name with 'ss' are resolved 
differently.


Regards,   Martin.


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-04 Thread Martin J. Dürst

There is charlint (http://www.w3.org/International/charlint/), which is 
based on UTF-8. It may be possible to adapt it to UTF-16/32.


Regards,   Martin.

On 2010/11/04 4:37, Jim Monty wrote:

Is there a utility, preferably open source and written in C, that inspects
UTF-16/UTF-16BE/UTF-16LE text and identifies broken surrogate pairs and illegal
characters? Ideally, the utility can both report illegal code units and repair
them by replacing them with U+FFFD.

Jim Monty






--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Martin J. Dürst


On 2010/11/05 2:46, Markus Scherer wrote:


16-bit Unicode is convenient in that when you find an unpaired surrogate
(that is, it's not well-formed UTF-16) you can usually just treat it like a
surrogate code point which normally has default properties much like an
unassigned code point or noncharacter. It case-maps to itself, normalizes to
itself, has default Unicode property values (except for the general
category), etc.


Well, yes, you can handle it that way, but that's pretty much GIGO 
(garbage in, garbage out) and dumping the problem on the next 
person/software downwards in the datastream. Also, while some things 
might still work, much stuff won't, e.g. when you try to find a word 
(with some lone surrogate hidden in some place) starting with the same 
word (but with some lone surrogate hidden in another place, or no such 
surrogate).



In other words, when you process 16-bit Unicode text it takes no effort to
handle unpaired surrogates, other than making sure that you only assemble a
supplementary code point when a lead surrogate is really followed by a trail
surrogate. Hence little need for cleanup functions -- but if you need one,
it's trivial to write one for UTF-16.


For some processing this is true, but it's rather short-sighted.

Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Martin J. Dürst


On 2010/11/05 8:30, Markus Scherer wrote:


If the conversion libraries you are using do not support this (I don't
know), then you could ask for such options. Or use conversion libraries that
do support such options (like ICU and Java).


The encoding conversion library in Ruby 1.9 also supports this. Here's 
an example:



utf16_borken = \x00a\x00b\xD8\x00\x00c\x00d.force_encoding('UTF-16BE')
utf8_clean = utf16_borken.encode('UTF-8',
 invalid: :replace, replace: '')
puts utf8_clean  # prints abcd


In general, and in particular for Unicode Encoding Forms, it's a bad 
idea to just replace with nothing, because of the security 
implications this might have. I guess that's the reason Perl doesn't 
allow this. But if you are sure there are no security implications, then 
there is no reason to not remove lone surrogates.


Regards,   Martin.


P.S.: Why would you use Ruby for conversion when programming in Perl? 
You could just as well program in Ruby, it's much more fun!



--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Fwd: RFC 6082 on Deprecating Unicode Language Tag Characters: RFC 2482 is Historic

2010-11-08 Thread Martin J. Dürst

FYI.   Regards,   Martin.

 Original Message 
Subject: RFC 6082 on Deprecating Unicode Language Tag Characters: RFC 
2482 is	Historic

Date: Sun,  7 Nov 2010 21:50:44 -0800 (PST)
From: rfc-edi...@rfc-editor.org
To: ietf-annou...@ietf.org, rfc-d...@rfc-editor.org
CC: rfc-edi...@rfc-editor.org

A new Request for Comments is now available in online RFC libraries.

RFC 6082

Title:  Deprecating Unicode Language Tag Characters:
RFC 2482 is Historic
Author: K. Whistler, G. Adams,
M. Duerst, R. Presuhn, Ed.,
J. Klensin
Status: Informational
Stream: IETF
Date:   November 2010
Mailbox:k...@sybase.com,
gl...@skynav.com,
due...@it.aoyama.ac.jp,
randy_pres...@mindspring.com,
john+i...@jck.com
Pages:  4
Characters: 6633
Obsoletes:  RFC2482

I-D Tag:draft-presuhn-rfc2482-historic-02.txt

URL:http://www.rfc-editor.org/rfc/rfc6082.txt

RFC 2482, Language Tagging in Unicode Plain Text, describes a
mechanism for using special Unicode language tag characters to
identify languages when needed without more general markup such as
that provided by XML.  The Unicode Consortium has deprecated that
facility and strongly recommends against its use.  RFC 2482 has been
moved to Historic status to reduce the possibility that Internet
implementers would consider that system an appropriate mechanism for
identifying languages.  This document is not an Internet Standards Track
specification; it is published for informational purposes.

INFORMATIONAL: This memo provides information for the Internet community.
It does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.

This announcement is sent to the IETF-Announce and rfc-dist lists.
To subscribe or unsubscribe, see
  http://www.ietf.org/mailman/listinfo/ietf-announce
  http://mailman.rfc-editor.org/mailman/listinfo/rfc-dist

For searching the RFC series, see http://www.rfc-editor.org/rfcsearch.html.
For downloading RFCs, see http://www.rfc-editor.org/rfc.html.

Requests for special distribution should be addressed to either the
author of the RFC in question, or to rfc-edi...@rfc-editor.org.  Unless
specifically noted otherwise on the RFC itself, all RFCs are for
unlimited distribution.

The RFC Editor Team
Association Management Solutions, LLC

___
IETF-Announce mailing list
ietf-annou...@ietf.org
https://www.ietf.org/mailman/listinfo/ietf-announce

Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Martin J. Dürst


On 2010/11/11 6:28, Mark Davis ☕ wrote:


That is actually not the case. There are superset relations among some of
the CJK character sets, and also -- practically speaking -- between some of
the windows and ISO-8859 sets. I say practically speaking because in general
environments, the C1 controls are really unused, so where a non ISO-8859 set
is same except for 80..9F you can treat it pragmatically as a superset.


Yes, except that the terms superset/subset (and set in general) 
shouldn't be used unless you really strictly speak about the repertoire 
of characters, and not the encoding itself. So e.g. the repertoire of 
iso-8859-1 is a subset of the repertoire of UTF-8. However, iso-8859-1 
is not a subset of UTF-8, not because you can't label some text encoded 
as iso-8859-1, but because subset relationships among the encodings 
themselves don't make sense).
Also, US-ASCII is not a subset of UTF-8, because when you just use the 
names of the character encodings, you mean the character encodings, and 
character encodings don't have subset relationships.


It may as well be possible to use (create?) the term sub-encoding, 
saying that an encoding A is a sub-encoding of encoding B if all (legal) 
byte sequences in encoding A are also legal byte sequences in encoding B 
and are interpreted as the same characters in both cases. In this sense, 
US-ASCII is clearly a sub-encoding of UTF-8, as well as a sub-encoding 
of many other encodings. You can also say that iso-8859-1 is a 
sub-encoding of windows-1252 if the former is interpreted as not 
including the C1 range.


Regards,   Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread Martin J. Dürst




On 2011/07/15 18:51, Michael Everson wrote:

On 15 Jul 2011, at 09:47, Andrew West wrote:



If you want a font to display a visible glyph for a format or space character 
then you should just map the glyph to its character in the font, as many fonts 
already do for certain format characters.


Sometimes I might want to show a dotted box for NBSP and sometimes a real NBSP. 
Or many other characters. Or show a RTL and LTR override character without 
actually overriding the text. You'd need a picture for that, because just 
putting in a glyph for it would also override the text.


I understand the need. But then what happens is that we need a picture 
in the standard for the character that depicts an RLO (but isn't 
actually one). And then you need another character to show that 
picture, and so on ad infinitum. This doesn't scale.


If we take the needs of charaacter encoding experts when they write 
*about* characters to decide what to make a character, then we get many 
too many characters encoded. That's similar to the need of typographers 
when they talk about different character shapes. If we had encoded a 
Roman 'a' and an Italic 'a' separately just because the distinction 
shows up explicitly in some texts on typography, that would have been a 
mistake (the separation is now available for IPA, but that's a separate 
issue).


Regards,   Martin.

Re: [bidi] Re: PRI 185 Revision of UBA for improved display of URL/IRIs

2011-07-29 Thread Martin J. Dürst


Hello Mark, others,

On 2011/07/28 5:01, Mark Davis ☕ wrote:

Just to remind people: posting to this list does *not* mean submitting to
the UTC. If you want to discuss a proposal here, not a problem, but just
remember that if you want any action you have to submit to the UTC.

Unicode members via: http://www.unicode.org/members/docsubmit.html
Others via: http://www.unicode.org/reporting.html


[I'll copy this text to the i...@ietf.org mailing list (mailing list of 
the EAI (Email Address Internationalization) WG, to have a public 
record, because that's the mailing list where most of the discussion 
about this draft in the IETF happened, as far as I'm aware of.]



Context
===

I'm an individual Unicode member, but I'll paste this in to the 
reporting form because that's easier. Please make a 'document' out of it 
(or more than one, if that helps to better address the issues raised 
here). I apologize for being late with my comments.



Substantive Comments


On substance, I don't agree with every detail of what Jonathan Rosenne, 
Behdad Esfahbod, Aharon Lanin and others have said, I agree with them in 
general. If their documents/messages are not properly submitted, I 
include them herewith by reference.


The proposal is an enormous change in the Bidi algorithm, changing its 
nature in huge ways. Whatever the details eventually may look like, it 
won't be possible to get everything right in one step, and probably 
countless tweaks will follow (not that they necessarily will make things 
better, though). Also, dealing with IRIs will increase the 
appetite/pressure for dealing with various other syntactical constructs 
in texts.


The introduction of the new algorithm will create numerous compatibility 
issues (and attack surfaces for phishing, the main thing the proposal 
tries to address) for a long period of time. Given that the Unicode 
Consortium has been working hard to address (compared to this issue) 
even extremely minor compatibility issues re. IDNs in TR46, it's 
difficult for me to see how this fits together.



Taking One Step Back


As one of the first people involved with what's now called IDNs and 
IRIs, I know that the problem of such Bidi identifiers is extremely 
hard. The IETF, as the standards organization responsible for 
(Internationalized) Domain Names and for URIs/IRIs, has taken some steps 
to address it (there's a Bidi section in RFC 3987 
(http://tools.ietf.org/html/rfc3987#section-4), and for IDNs, there is 
http://tools.ietf.org/html/rfc5893).


I don't think these are necessarily sufficient or anything. And I don't 
think that the proposal at hand is completely useless. However, the 
proposal touches many aspects (e.g. recognizing IRIs in plain text,...) 
that are vastly more adequate for definition in another standards 
organization or where a high-bandwidth coordination with such an 
organization is crucial (roughly speaking, first on feasibility of 
various approaches, then on how to split up the work between the 
relevant organizations, then on coordination in details.) Without such a 
step back and high-bandwidth coordination, there is a strong chance of 
producing something highly suboptimal.


(Side comment on  detail: It would be better for the document to use 
something like
http://tools.ietf.org/html/rfc3987#section-2.2 rather than the totally 
obscure and no longer maintained 
http://rfc-ref.org/RFC-TEXTS/3987/chapter2.html, in the same way the 
Unicode Consortium would probably prefer to have its own Web site 
referenced for its work rather than some third-party Web site.)



Taking Another Step Back


I mention 'high-bandwidth' above. The Unicode Public Review process is 
definitely not suited for this. It has various problems:

- The announcements are often very short, formalistic, and cryptic
  (I can dig up examples if needed.)
- The announcements go to a single list; forwarding them to other
  relevant places is mostly a matter of chance. This should be improved
  by identifying the relevant parties and contacting them directly.
- To find the Web form, one has to traverse several links.
- The submission is via a Web form, without any confirmation that the
  comment has been received.
- The space for comments on the form is very small.
- There is no way to make a comment public (except for publishing it
  separately).
- There is no official response to a comment submitted to the Web form.
  One finds out about what happened by chance or not at all.
  (compare to W3C process, where WGs are required to address each
   comment formally, and most of them including the responses are
   public)
- The turnaround is slow. Decisions get made (or postponed) at UTCs
  only.
Overall, from an outsider's point of view, the review process and the 
review form feel like a needle's ear connected to a black hole.


[I very much understand that part of the reason the UTC works the way it 
works is because of

Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?

2011-09-09 Thread Martin J. Dürst


On 2011/09/10 9:32, Stephan Stiller wrote:

Actually, I *was* talking about purely typographic/aesthetic ligatures
as well. I'm aware that which di-/trigraphs need to be considered from a
font design perspective is language-dependent.


And this language-dependence is not only a question of letter 
combination frequency, but also of aesthetic preference.


What I have heard very often is that Frenchs has a preference for using 
many ligatures, whereas Italian uses almost none.



But the point is that I
observe that:
(a) aesthetic ligatures are not frequently seen in modern German print and
(b) the absence of such ligatures doesn't offend me (in modern German
print).


I think part of that comes from the fact that with modern DTP, lots of 
fonts are used across languages without any particular adjustments with 
respect to ligatures. (This may not be the case for high-end order-made 
fonts used by publishing houses, but it's certainly true for the 
run-of-the mill Times Roman, Helvetica, and so on used on PCs.)


Typography is always an interplay between designer, reader, and 
technology. So what probably happened is that the technology-induced use 
of the same fonts across languages let to designs with less 
language-specific ligatures (essentially lowest-common-denominators in 
terms of ligatures) and to an adjustment of the designs so that this 
infrequency of ligatures would be less visible. Also, you and other 
readers got used to these designs.


Regards,Martin.


It could be - and a quick visual check confirms this - that the fonts
used for printing of {novels, school textbooks, tech/science books, ...}
and the associated kerning tables don't necessitate ligatures or have
traditionally (fwiw) not been seen as necessitating them. Enough
professional publishing houses I _think_ don't use aesthetic ligatures,
so that, whenever I do see them in German text, they stand out to me. So
/de facto/ usage of aesthetic ligatures seems a bit like a locale
parameter to me.

That said - if I'm really factually wrong (and ligatures in modern
German text are just so subtle and pervasive that I never took notice),
people on the list please feel free to correct me.

Stephan

On 9/9/2011 4:14 PM, Kent Karlsson wrote:

I was talking about purely typographic ligatures, in particular
ligatures used because the glyphs (normally spaced) would otherwise
overlap in an unpleasing manner. If the glyphs don't overlap (or
there is extra spacing, which is quite ugly in itself if used in
normal text), no need to use a (purely typographic) ligature.
So it is a font design issue. (And then there are also ornamental
typographic ligatures, like the st ligature, but those are outside
of what I was talking about here.) But of course, which pairs of
letters (or indeed also punctuation) are likely to occur adjacently
is language dependent.

/Kent K


Den 2011-09-09 23:45, skrev Stephan Stillersstil...@stanford.edu:


Pardon my asking, as this is not my specialty:


There are several other ligatures
that *should* be formed (automatically) by run of the mill fonts:
for instance the fj ligature, just to mention one that I find
particularly important (and that does not have a compatibility code
point).

About the should - isn't this language-dependent? For example I recall
that ordinary German print literature barely uses any ligatures at all
these days (ie: I'm not talking about historical texts). And, has anyone
ever attempted to catalogue such ligature practices? (Is this suitable
for CLDR?)

(I also recall being taken aback by the odd look of ligatures in many
LaTeX-typeset English scientific documents, but I suspect that's rather
because some of the commonly used fonts there are lacking in aesthetic
design.)

Stephan

Re: continue: Glaring Mistake in nomenclature

2011-09-14 Thread Martin J. Dürst

Hello Delex,

On 2011/09/14 15:55, delex r wrote:

The “Dark age of Assamese language” ran for about 37 years in this region when
it was tried to kill a the language by vested interests with the help of
British Political powers imposing Bengali as medium of instruction in school
and colleges and for all official purpose.

That sounds like a very sad story, but a long time ago. Please think
about how you can affect the future, because you can't change the past.

I think now naming the script as “ Bengali” that too by stealing two unique
letters from the Assamese alphabet list and coloring them with Bengali hue is
part of that notorious linguistic invasion.

No, these letters clearly belong to the same script. That the script was
named Bengali in the standard may be unfortunate, in particular from
your viewpoint, but as far as the official standards are concerned, it
can't be changed (as many others already have told you). Please note
that you (and anybody else) can call this script whatever you think is
most appropriate.

What I think you might be able to ask for is to have some annotation for
the two letters in question, in the same way as e.g. the Arabic block
has lots of annotations for what language uses which character for those
characters that are not part of the base Arabic alphabet.

But why don't you look out for things you can change, and that would be
much more productive to help your goal of furthering the Assamese
language. For example:
a) Check what the problems (if any) there are with technologies such as
CSS for styling,... to be able to use Assamese without problems on the
Internet and the Web and elsewhere. (If you find something, please
direct any comments to the relevant mailing lists, and not to this one.)
b) (this one is easier and requires more manpower): Contribute to the
Assamese language by publishing content, contributing to Web sites such
as Wikipedia, and so on. As an example, it looks as if the Wikipedia
article on the Assamese language in the Assamese language
(http://as.wikipedia.org/wiki/অসমীয়া_ভাষা) is still quite incomplete.

Regards, Martin.

Re: Civil suit; ftp shutdown; mailing list shutdown

2011-10-06 Thread Martin J. Dürst

[By accident, I sent this only to Ken first; he recommended I send it to 
both Unicode and Unicore.]


I have sent a mail to a relevant IETF list (apps-disc...@ietf.org); the 
IETF was looking into taking this over, with 
http://tools.ietf.org/html/draft-lear-iana-timezone-database-04, but 
apparently, Unicode got alerted first.


In terms of practical matters, two points seem important to me:

First, to ask the judge for a temporary permission (there's a better 
legal term, but IANAL) to keep the database up until the law suit is 
settled (because the database is probably down now due to a temporary 
order from the judge to that effect) because of its high practical 
importance.


Second, what seems to be in dispute is data about old history. While 
this is important for some applications, in most applications, present 
and new data is much more important, so one way to avoid problems would 
be to publish only new data at some new place until the case is settled. 
That would mean that applications would have to be checked for whether 
they need the old data or not. Or to only publish diffs (which would be 
about new, present-day data not from the source under litigation).


Regards,   Martin.

On 2011/10/07 4:45, Ken Lunde wrote:

Arle and others,

The URL for the following blog post was tweeted a few minutes ago:

   http://blog.joda.org/2011/10/today-time-zone-database-was-closed.html

-- Ken

On Oct 6, 2011, at 9:45 AM, Arle Lommel wrote:


Is there any public information about the lawsuit? I was stunned to see the 
forwarded mail and want to understand the implications of this lawsuit, but I 
can't find any news about it other than Arthur’s rather telegraphic note. I 
understand that he may not be able to comment given pending litigation, but if 
we had any information at all about what the suit is, it might help clarify if 
there is any need for concern.

-Arle


It would be nice, but I don't think the Consortium can do that without first 
understanding if it gets exposed to its own lawsuit.

Eric.

Re: Civil suit; ftp shutdown; mailing list shutdown

2011-10-07 Thread Martin J. Dürst


Unicode peolpe:

To follow this subject, I recommend to look through
http://mm.icann.org/pipermail/tz/ or subscribe to that mailing list at 
https://mm.icann.org/mailman/listinfo/tz.
In addition, please see 
http://www.ietf.org/mail-archive/web/apps-discuss/current/msg03374.html.


Regards,Martin.

On 2011/10/07 14:14, Martin J. Dürst wrote:

[By accident, I sent this only to Ken first; he recommended I send it to
both Unicode and Unicore.]

I have sent a mail to a relevant IETF list (apps-disc...@ietf.org); the
IETF was looking into taking this over, with
http://tools.ietf.org/html/draft-lear-iana-timezone-database-04, but
apparently, Unicode got alerted first.

In terms of practical matters, two points seem important to me:

First, to ask the judge for a temporary permission (there's a better
legal term, but IANAL) to keep the database up until the law suit is
settled (because the database is probably down now due to a temporary
order from the judge to that effect) because of its high practical
importance.

Second, what seems to be in dispute is data about old history. While
this is important for some applications, in most applications, present
and new data is much more important, so one way to avoid problems would
be to publish only new data at some new place until the case is settled.
That would mean that applications would have to be checked for whether
they need the old data or not. Or to only publish diffs (which would be
about new, present-day data not from the source under litigation).

Regards, Martin.

On 2011/10/07 4:45, Ken Lunde wrote:

Arle and others,

The URL for the following blog post was tweeted a few minutes ago:

http://blog.joda.org/2011/10/today-time-zone-database-was-closed.html

-- Ken

On Oct 6, 2011, at 9:45 AM, Arle Lommel wrote:


Is there any public information about the lawsuit? I was stunned to
see the forwarded mail and want to understand the implications of
this lawsuit, but I can't find any news about it other than Arthur’s
rather telegraphic note. I understand that he may not be able to
comment given pending litigation, but if we had any information at
all about what the suit is, it might help clarify if there is any
need for concern.

-Arle


It would be nice, but I don't think the Consortium can do that
without first understanding if it gets exposed to its own lawsuit.

Eric.

Re: about P1 part of BIDI alogrithm

2011-10-10 Thread Martin J. Dürst

On 2011/10/10 21:10, Eli Zaretskii wrote:

Date: Mon, 10 Oct 2011 17:47:21 +0800
From: li bolibo@gmail.com

From section 3:

   Paragraphs are divided by the Paragraph Separator or appropriate
   Newline Function (for guidelines on the handling of CR, LF, and CRLF,
   see Section 4.4, Directionality, and Section 5.8, Newline Guidelines
   of [Unicode]). Paragraphs may also be determined by higher-level
   protocols: for example, the text in two different cells of a table
   will be in different paragraphs.

I think only 'Enter' and '*Paragraph separator*' can do paragraph breaking.

In addition to the Paragraph Separator, _any_ newline function (LF,
CR+LF, CR, or NEL) can end a paragraph.  Also U+2028, the LS
character.  See section 5.8 of the Unicode Standard cited above.

No, U+2028 (LS) is explicitly *not* a Paragraph Separator. It just 
indicates where to break a line (rather than leaving that to the 
implementation), but doesn't restart the Bidi algorithm.

Regards,   Martin.

Re: Solidus variations

2011-10-10 Thread Martin J. Dürst


On 2011/10/11 7:35, Philippe Verdy wrote:


I've seen various interpretations, but the ASCII solidus is
unambiguously used with a strong left-to-right associativity, and the
same occurs in classical mathematics notations (the horizontal bar is
another notation but even where it is used, it also has the equivalent
top-to-bottom associatity).


Horizontal bars surely work by using bars of differing length, with 
shorter bars having higher priority. Horizontal bars of equal length 
would be very weird.


Regards,   Martin.

Re: about P1 part of BIDI alogrithm

2011-10-10 Thread Martin J. Dürst


On 2011/10/11 10:29, Martin J. Dürst wrote:



On 2011/10/10 21:10, Eli Zaretskii wrote:

Date: Mon, 10 Oct 2011 17:47:21 +0800



In addition to the Paragraph Separator, _any_ newline function (LF,
CR+LF, CR, or NEL) can end a paragraph. Also U+2028, the LS
character. See section 5.8 of the Unicode Standard cited above.


No, U+2028 (LS) is explicitly *not* a Paragraph Separator. It just
indicates where to break a line (rather than leaving that to the
implementation), but doesn't restart the Bidi algorithm.


I might add here that 'break a line' in the Bidi algorithm is done 
before actual reordering (which is done line-by-line), but after 
calculating all the levels.


This is different from what you did in Emacs, which I'd call 
line-folding, i.e. cut the line after a paragraph is laid out and 
reordered completely as a single (potentially very long) line. This 
makes some sense in Emacs, where the basic assumption is that lines 
should fit into the width of the view.


Regards,Martin.

Re: about P1 part of BIDI alogrithm

2011-10-10 Thread Martin J. Dürst

On 2011/10/11 13:07, Eli Zaretskii wrote:

Date: Tue, 11 Oct 2011 10:53:39 +0900
From: Martin J. Dürstdue...@it.aoyama.ac.jp
CC: li bolibo@gmail.com, unicode@unicode.org

This is different from what you did in Emacs, which I'd call
line-folding, i.e. cut the line after a paragraph is laid out and
reordered completely as a single (potentially very long) line. This
makes some sense in Emacs, where the basic assumption is that lines
should fit into the width of the view.

Sorry, I don't follow you.  There's no such line-folding in the
Emacs implementation of the UBA.  A line that doesn't fit the window
width is reordered as a whole.  Conceptually, reordering is done
before breaking a long line into continuation lines.

This is exactly what I meant. In Emacs, reordering is done before 
breaking a long line into smaller segments to fit into the width of the 
display window. I called this line-folding, you call it continuation 
lines.

But in the bidi algorithm itself, line breaking (be it automatic due to 
a layout algorithm or explicit due to LS or something similar) is 
applied *before* reordering. This is very important, because otherwise, 
content that is logically earlier may appear on later lines, which would 
be very confusing for readers.

Regards,   Martin.

Re: about P1 part of BIDI alogrithm

2011-10-11 Thread Martin J. Dürst


Hello Eli,

There is absolutely no problem to treat the algorithm in UAX#9 as a set 
of requirements, and come up with a totally different implementation 
that produces the same results. I think actually UAX#9 says so somewhere.


But what is, strictly speaking, not allowed is to change the 
requirements. One requirement of the algorithm is that when lines are 
broken, logically earlier characters stay on earlier lines, and 
logically later characters move to later lines.


In this respect, your implementation doesn't conform to UAX#9. There's 
an external reason for this, and an internal one. The external reason is 
that continuation lines in Emacs are in general just an overflow device, 
text in Emacs isn't supposed to be broken into lines in the same way as 
e.g. word processors break lines to form paragraphs. I'm not sure how 
much it is true (line breaks often e.g. interfere with formatting in 
Japanese and other languages that don't use spaces between words and 
don't work well with a convention of convert a line break in the source 
to a space in the output), but I think to some extent it is true.


The internal reason is the one you describe below. It may indeed be a 
strong reason from an implementation perspective, but from an user 
perspective, it's a very weak reason. Also, I don't understand it fully. 
You say that the Emacs display engine examines each character in turn. 
Assuming these are in logical order, you would just examine them up to 
the point where you have about one line of glyphs. There would indeed 
be a bit of back and forth there because of the interaction between bidi 
algorithm and glyph selection (but as far as I know, mirrored glyphs 
mostly have the same width as their originals). Anyway, that bit of back 
and forth seems to be much less of a problem than the back and forth 
that you get when you have to reorder over much larger distances because 
you're essentially considering a whole paragraph as a single line. But 
I'm not an expert in Emacs display engine details, so I can't say for sure.


Regards,   Martin.

On 2011/10/11 16:43, Eli Zaretskii wrote:

Date: Tue, 11 Oct 2011 10:53:39 +0900
From: Martin J. Dürstdue...@it.aoyama.ac.jp
CC: li bolibo@gmail.com, unicode@unicode.org

I might add here that 'break a line' in the Bidi algorithm is done
before actual reordering (which is done line-by-line), but after
calculating all the levels.


Please be aware that this separation of the UBA into phases makes no
sense at all in the context of Emacs display engine.  The UBA is
written from the POV of batch processing of a block of text -- you
pass in a string in logical order, and receive a reordered string in
return.  The UBA describes the processing as a series of phases, each
one of which is completed for all the characters in the block of text
before the next phase begins.

By contrast, the Emacs display engine examines the text to display one
character at a time.  For each character, it loads the necessary
display and typeface information, and then decides whether it will fit
the display line.  Then it examines the next character, and so on.  It
should be clear that processing characters one by one completely
disrupts the subdivision of the UBA into the phases that include
examination of more than that single character, let alone decisions of
where to break the line, because reordering can no longer be done
line by line.

Let me give you just one example: if the character should be mirrored,
you cannot decide whether it fits the display line until _after_ you
know what its mirrored glyph looks like.  But mirroring is only
resolved at a very late stage of reordering, so if you want to reorder
_after_ breaking into display lines, you will have to back up and
reconsider that decision after reordering, which will slow you down.

Given these considerations, it is a small wonder that the UBA
implementation inside Emacs is _very_ different from the description
in UAX#9.  Therefore, the subdivision into phases that are on the line
and higher levels makes very little sense here, since the
implementation needed to produce an identical result while performing
a significant surgery on the algorithm description.  In effect, the
UBA implementation in Emacs treated UAX#9 as a set of requirements,
not as a high-level description of the implementation.

Re: about P1 part of BIDI alogrithm

2011-10-11 Thread Martin J. Dürst


Hello Kent,

I was also very much thinking that mirrored glyph should be of the same 
width, but there might be subtle issues when you consider kerning. As a 
very basic example, think about kerning of the pair K), and then think 
about K(.


Regards,   Martin.

On 2011/10/11 19:39, Kent Karlsson wrote:


Den 2011-10-11 09:43, skrev Eli Zaretskiie...@gnu.org:


Let me give you just one example: if the character should be mirrored,
you cannot decide whether it fits the display line until _after_ you
know what its mirrored glyph looks like.  But mirroring is only
resolved at a very late stage of reordering, so if you want to reorder
_after_ breaking into display lines, you will have to back up and
reconsider that decision after reordering, which will slow you down.


Well, I think there is a silent (but reasonable, I would say) assumption
that mirroring does not change the width of a glyph... I would think that if
a font does not fulfill that, then you have a font problem (or mix of fonts
problem), not a bidi problem. Glyphs for characters that may mirror do not
normally form ligatures with other glyphs; and even if they do, the width of
the ligature should not change relative to the total with of the preligature
glyphs involving glyphs for mirrorable characters (and if it does change
anyway, you again have a font problem that may result in a somewhat ugly
display that should be fixed by fixing the font, not a bidi problem). I'm
not thinking about Emacs here, but in general.

 IMHO
 /Kent K

Wrong UTF-8 encoders still around?

2011-10-20 Thread Martin J. Dürst

I'm hoping to get some advice from people with experience with various 
Unicode/transcoding libraries.


RFC 3987 (the current IRI spec) has the following text:

   Note: Some older software transcoding to UTF-8 may produce illegal
  output for some input, in particular for characters outside the
  BMP (Basic Multilingual Plane).  As an example, for the IRI with
  non-BMP characters (in XML Notation):
  http://example.com/#x10300;#x10301;#x10302;;
  which contains the first three letters of the Old Italic alphabet,
  the correct conversion to a URI is
  http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82;

We are thinking about removing this because we hope that software has 
improved in the meantime, but we would like to be sure about this.


If anybody knows about software out there that still presents this 
problems, please tell us.


Thanks,Martin.

Forum Problems

2011-10-24 Thread Martin J. Dürst

How can one use the Forum to comment on URI/IRI issues when one gets a 
message:


Your message contains too many URLs. The maximum number of URLs allowed 
is 8.


I never liked this forum stuff too much, and this hasn't made things 
better :-(.


Regards,   Martin.

Default bidi ranges

2011-11-09 Thread Martin J. Dürst

I tried to find something like a normative description of the default 
bidi class of unassigned code points.


In UTR #9, it says 
(http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types):


Unassigned characters are given strong types in the algorithm. This is 
an explicit exception to the general Unicode conformance requirements 
with respect to unassigned characters. As characters become assigned in 
the future, these bidirectional types may change. For assignments to 
character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD].


The DerivedBidiClass.txt file, as far as I understand, is mainly a 
condensation of bidi classes into character ranges (rather than giving 
them for each codepoint independently as in UnicodeData.txt). I.e. it 
can at any moment be derived automatically from UnicodeData.txt, and is 
as such not normative.


Why is it then that the default class assignments are only given in this 
file (unless I have overlooked something)? And why is it that they are 
only given in comments? I'm trying to create a program that takes all 
the bidi assignments (including default ones) and creates the data part 
of a bidi algorithm implementation, but I don't feel confident to code 
against stuff that's in comments. Any advice? Is it possible that this 
could be fixed (making it more normative, and putting it in a form 
that's easier to process automatically)?


Regards,   Martin.

Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-20 Thread Martin J. Dürst


On 2011/11/21 5:54, Asmus Freytag wrote:

On 11/20/2011 8:00 AM, Joó Ádám wrote:

Leaving aside that CSS is presentation and not content, and is
definitely not markup. HTML is a better candidate.

Á


The details of the appearance of the mark would be presentation.
The scoping, like for applying every other style feature, would have to
be supplied via HTML, XML you name it.
I can see where you'd want something other than a generic span to
provide that scoping.


I agree with Asmus here.

It's important to point out that having it in CSS doesn't mean that it 
couldn't also go into HTML. But these days, anything presentational goes 
into CSS, and if there's markup with a default presentation, then HTML 
just mentions the markup, and for presentation defers to CSS.


Putting it in CSS also means that it can be used from other kinds of 
markup (e.g. totally unrelated to HTML or even XML).


If you want to make serious progress, I propose to check what's in TEI 
(because that, and not HTML, is the markup of choice for these kinds of 
text). If what's currently in TEI isn't sufficient in terms of markup, 
please work with them to improve the situation.


Also, work with CSS to look into the presentation issues. In particular, 
look at what there's already around from the presentation side of MathML.


In HTML, it should always be possible to start generically, i.e. with a 
span and some class attributes.


Regards,Martin.

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst


On 2012/04/28 4:26, Mark Davis ☕ wrote:

Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.


Because punycode encodes differences between character numbers, not the 
character numbers themselves, it can indeed be quite efficient in 
particular if the characters used are tightly packed (e.g. Greek, 
Hebrew,...). For languages with Latin script and accented characters, 
the question is how close these accented characters are in Unicode.


However, punycode also codes character positions. Because of this, it 
gets less efficient for longer text.


[Because punycode uses (circular) position differences rather than 
simple positions, this contribution is limited by the (rounded-up binary 
logarithm of the) weighted average distance between two same characters 
in the text/language.]


My guess is therefore that punycode won't necessarily be super-efficient 
for texts in the 100+ character range. It's difficult to test quickly 
because the punycode converters on the Web limit the output to 63 
characters, the maximum length of a label in a domain name.


Regards,Martin.

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst


On 2012/04/28 7:29, Cristian Secară wrote:

În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris:


Actually, if the goal is to get as many characters in as possible,
Punycode might be the best solution. That is the encoding used for
internationalized domains. In that form, it uses a smaller number of
bytes per character, but a parameterization allows use of all byte
values.


I suspect the punycode goal is to take a wide character set into a
restricted character set, without caring much on resulting string
length; if the original string happens to be in other character set
than the target restricted character set, then the string length
increases too much to be of interest in the SMS discussion.


Not exactly. Compression was very much a goal when designing punycode. 
It won against a number of other algorithms as the choice for IDNs and 
is clearly very good for that purpose.




Just do a test: write something in a non-Latin alphabetic script into
this page here http://demo.icu-project.org/icu-bin/idnbrowser


Well, as a silly example, what about 
α?

(that's 57 α characters). The result is
xn--mxa,
which is 63 characters long.

Regards,   Martin.

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst


On 2012/04/27 17:06, Cristian Secară wrote:


It turned out that they (ETSI  its groups) created a way to solve the
70 characters limitation, namely “National Language Single Shift” and
“National Language Locking Shift” mechanism. This is described in 3GPP
TS 23.038 standard and it was introduced since release 8. In short, it
is about a character substitution table, per character or per message,
per-language defined.

Personally I find this to be a stone-age-like approach,


Fully agreed.


which in my
opinion does not work at all if I enter the message from my PC keyboard
via the phone's PC application (because the language cannot always be
predicted, mainly if I am using dead keys). It is true that the actual
SMS stream limit is not much generous, but I wonder if the SCSU would
have been a better approach in terms of i18n. I also don't know if the
SCSU requires a language to be prior declared, or it simply guess by
itself the required window for each character.


The right approach in this case isn't to discuss clever compression 
techniques (I've indulged in this in my other mails, too, sorry), but to 
realize that the underlying mobile/wireless technology has advanced a lot.


SMSes are simply a relict of outdated technology, sold at a horrendous 
price. For more information, see e.g. 
http://mobile.slashdot.org/comments.pl?sid=433536cid=22219254 or 
http://gthing.net/the-true-price-of-sms-messages. That's even for the 
case of pure ASCII messages.


The solution is simply to stop using SMSes, and upgrade to a better 
technology.


Regards,   Martin.

Re: Unicode, SMS and year 2012

2012-04-29 Thread Martin J. Dürst


On 2012/04/29 18:58, Szelp, A. Sz. wrote:

While there are good reasons the authors of HTML5 brought to ignore SCSU or
BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping
of Unicode codepoints to byte values seems shortsighted.


Well, except that it's hopelessly inefficient and therefore essentially 
nobody is using it.



We are talking about the whole of Unicode, not just BMP.


Yes. For transmission, use UTF-8 (or maybe UTF-16).

Regards,Martin.

Flag tags (was: Re: Unicode 6.2 to Support the Turkish Lira Sign)

2012-05-29 Thread Martin J. Dürst


On 2012/05/29 17:43, Asmus Freytag wrote:

On 5/27/2012 5:52 PM, Michael Everson wrote:

Get over it. Please just get over it. It doesn't matter. It's a blort.


Time to agree with Michael.

Get over it, is good advice here.

Sovereign countries are free to decree currency symbols, whatever their
motivation or the putative artistic or typographic merits of the symbol
in question. Not for Unicode to judge.


I'd have to agree here.

On a slightly (although maybe only slightly) related matter, what about 
if Unicode didn't judge how difficult it should be to display national 
flags. Creating a way to display flags from two-tag combinations and 
then later realizing that a sequence of such tags didn't locally parse, 
and the whole thing has to be redone, doesn't seem like a very good 
alternative to just encoding these things (not that I think that just 
encoding these is a very good alternative either, though).


Regards,   Martin.

Re: Unicode 6.2 to Support the Turkish Lira Sign

2012-05-30 Thread Martin J. Dürst


On 2012/05/30 4:42, Roozbeh Pournader wrote:


Just look what happened when the Japanese did their own font/character set
hack. The backslash/yen problem is still with us, to this day...


To be fair, the Japanese Yen at 0x5C was there long before Unicode, in 
the Japanese version of ISO 646. That it has remained as a font hack is 
very unfortunate, but for that, not only the Japanese, but also major 
international vendors are to blame.


Regards,   Martin.

Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-10 Thread Martin J. Dürst


On 2012/07/11 4:37, Asmus Freytag wrote:


I recall, with certainty, having seen the : in the context of
elementary instruction in arithmetic,
as in 4 : 2 = ?, but am no longer positive about seeing ÷ in the same
context.


I remember this very well. In grade school, we had to learn two ways to 
divide, which were distinguished by using two symbols, ':' and '÷', and 
different verbs, the German equivalents of divide and measure.


I'll explain the difference with two examples:

a) There are 12 apples, and four kids. How many apples does each kid 
get? [answer: 3 apples]


b) There are 12 apples, and each kid gets 4 of them. For how many kids 
will that be enough? [answer: for 3 kids]


I think a) was called 'divide' and b) was called 'measure', but I can't 
remember which symbol was used for which.


When we were learning this, I thought it was a bit silly, because the 
numbers were the same anyway. It seems to have been based on the 
observation that at a certain stage in the development of arithmetic 
skills, children may be able to do division (in the general, numeric 
sense) one way but not the other, or that they get confused about the 
units in the answer. But while such an observation may be true, I don't 
think such a stage lasts very long, definitely not as long as we had to 
keep the distinction (at least through second and third grade).


Also, I think this may have been a local phenomenon, both in place and 
time. But if one searches for geteilt gemessen, one gets links such as 
this:

http://www.niska198.de.tl/Gemessen-oder-Geteilt-f-.htm
So maybe some of this is still in use.

Regards,   Martin.

Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-10 Thread Martin J. Dürst


On 2012/07/11 10:35, Stephan Stiller wrote:

About Martin Dürst's content re geteilt-gemessen:

When I attended the German school system in approx the 1990s this
distinction wasn't mentioned or taught. (I prefer to not give details
about specific time and place for privacy reasons.)


Sorry, but I forgot to mention that my experience was in Switzerland, in 
the late 1960ies. Actually, given that the education system in 
Switzerland is handled by the Cantons, I should say that it was in the 
Canton of Zurich.


Regards,   Martin.



From looking into
textbooks and formula collections at that time I recall not having found
any mention of or explanation for such a differentiation. Given that I
also haven't seen many people use that symbol I would suspect that, for
some time, this was an elementary school thing in Germany. For me, the
symbol ÷ also only ever appeared on calculators. I don't think it
appeared ever in primary or secondary school textbooks I've worked with
and wasn't used for handwritten arithmetic at my schools either.

Stephan

PS: Thank you! You've just solved a mystery for me - something I've been
told about a long time ago by an older person but couldn't find
references for at the time.

Re: Sinhala naming conventions

2012-07-10 Thread Martin J. Dürst


On 2012/07/11 11:04, Mark E. Shoulson wrote:


Ever start to feel that we would have been better off not to give
official descriptive names at all? Or else really vague ones like
LETTERLIKE THINGY NUMBER 5412? So much blood-pressure raised over the
names...


I'm feeling that way since about the mid-1990ies, since I discovered 
that for CJK Ideograms, there is a cop-out of
CJK UNIFIED IDEOGRAPH 4E00 and so on. It's also the only place where 
numerals are allowed in character names.


Regards,   Martin.

Re: pre-HTML5 and the BOM

2012-07-13 Thread Martin J. Dürst


On 2012/07/13 0:12, Leif Halvard Silli wrote:

Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600:


and people who want to create or modify UTF-8 files which will
be consumed by a process that is intolerant of the signature
should not use Notepad. That goes for HTML (pre-5) pages [snip]


HTML5-parsers MUST support UTF-8. They do not need to support any other
encoding. Pre-HTML5-parsers are not required to support the UTF-8
encoding - or any other particular encoding.


Up to here, that's indeed what the spec says, except for XHTML, which is 
XML and therefore includes UTF-8 (and UTF-16) support, but my guess is 
that you didn't include this.



But when they do support
the UTF-8 encoding, they are, however, not permitted to be 'intolerant'
of the BOM.


Where does it say so?

Regards,   Martin.



Thus there is nothing special with regard to the UTF-8 BOM and
pre-HTML5 HTML.

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst


On 2012/07/13 22:31, Jukka K. Korpela wrote:

2012-07-13 16:12, Leif Halvard Silli wrote:


The kind of BOM intolerance I know about in user agents is that some
text browsers and IE5 for Mac (abandoned) convert the BOM into a
(typically empty) line a the start of the body element.


I wonder if there is any evidence of browsers currently in use that have
problems with BOM.


I'd assume that so-called modern browsers don't have such problems.


I suppose such browsers existed, though I can't be sure.


They indeed did exist.


In any cases, for several years I haven't seen any descriptions of
real-life observations, but there are rumors and warnings, and people
get disturbed. Even reputable sites have instructions against using BOM:

When the BOM is used in web pages or editors for UTF-8 encoded content
it can sometimes introduce blank spaces or short sequences of
strange-looking characters (such as ï»¿). For this reason, it is usually
best for interoperability to omit the BOM, when given a choice, for
UTF-8 content.
http://www.w3.org/International/questions/qa-byte-order-mark


This could be toned down a bit, but I still agree (and the Unicode 
consortium says the same): There may be good reasons to use a BOM, but 
if these reasons don't apply, then don't use it.


Regards,Martin.

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst


On 2012/07/14 1:33, Philippe Verdy wrote:

Fra: Jukka K. Korpelajkorp...@cs.tut.fi

When the BOM is used in web pages or editors for UTF-8 encoded content it
can sometimes introduce blank spaces or short sequences of strange-looking
characters (such as ï»¿). For this reason, it is usually best for
interoperability to omit the BOM, when given a choice, for UTF-8 content.

http://www.w3.org/International/questions/qa-byte-order-mark



This statemant for maximum interoperability may have been true in the
past, where Unicode support was not so universal and still not adopted
formally for all newer developments in RFCs published by the IETF. But
now the situation is reversed : maximum interoperability if offered
when BOMs are present, not really to indicate the byte order itself,
but to confirm that the content is Unicode encoded and extremely
likely to be text content and not arbitrary binary contents (that
today almost always use a distinctive leading signature).


As you mention the IETF, what people in the IETF like most about UTF-8 
is that it's upward-compatible with ASCII. Because the 
protocol/syntax-relevant part is usually ASCII only, that means that a 
lot of stuff can work just by making things 8-bit clean (which in this 
day and age may mean essentially no work in some cases).


A BOM anywhere in a protocol therefore just removes the biggest 
advantage of UTF-8. While it's usually okay to use a BOM at the start of 
a whole file (or the file equivalent in transmission, which is a MIME 
entity), anywhere else (e.g. in small protocol fields), a BOM is a big 
no-no.


Regards,   Martin.

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst


On 2012/07/17 17:22, Leif Halvard Silli wrote:


And an argument was put forward in the WHATWG mailinglist
earlier tis year/end of previous year, that a page with strict ASCII
characters inside could still contain character entities/references for
characters outside ASCII.


Of course they can. That's the whole point of using numeric character 
references. I'm rather surprised that this was even discussed in the 
context of HTML5.



For instance, early on in 'the Web', some
appeared to think that all non-ASCII had to be represented as entities.


Yes indeed. There's still some such stuff around. It's mostly 
unnecessary, but it doesn't hurt.


Regards,Martin.

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst


Hello Leif,

Sorry to be late with my answer.

On 2012/07/13 20:44, Leif Halvard Silli wrote:

Martin J. Dürst, Fri, 13 Jul 2012 18:17:05 +0900:

On 2012/07/13 0:12, Leif Halvard Silli wrote:

Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600:


and people who want to create or modify UTF-8 files which will
be consumed by a process that is intolerant of the signature
should not use Notepad. That goes for HTML (pre-5) pages [snip]


HTML5-parsers MUST support UTF-8. They do not need to support any other
encoding. Pre-HTML5-parsers are not required to support the UTF-8
encoding - or any other particular encoding.


Up to here, that's indeed what the spec says, except for XHTML, which
is XML and therefore includes UTF-8 (and UTF-16) support, but my
guess is that you didn't include this.


Right. I meant pre-HTML5 HTML as text/html. Not pre-HTML5 HTML as XML.


But when they do support
the UTF-8 encoding, they are, however, not permitted to be 'intolerant'
of the BOM.


Where does it say so?


What is 'it'?


That pre-HTML5 (as text/html) browsers are not permitted to be 
'intolerant' of the BOM.



HTML5 tells how UAs should use BOM to decide the encoding. By
pre-HTML5, I meant the 'text/html' MIME space, though I gave much
weight to HTML4 ...

I see that HTML4 for UTF-8 points to RFC2279,[1] which was silent about
the UTF-8 BOM. Only with RFC3629 from 2003, is the UTF-8 BOM
described.[3]


Yes exactly. In the RFC 2070 and HTML4 time-frame, nobody that I know 
was thinking about a BOM for UTF-8. Only later BOMs at the start of 
HTML4 started to turn up, and browser makers were surprised. Roughly the 
same happened for XML. Early XML parsers didn't handle the BOM.


When Windows notepad started to use the BOM to distinguish between UTF-8 
and ANSI (the local system legacy encoding), this BOM leaked into 
HTML, and was difficult to stop. So XML got updated, and parsers started 
to get updated, too.




As for XML 1.0, then revision 2 from year 2000 appears to
be the first time the XML spec describes the UTF-8 BOM.[4] The Appendix
C 'profile' of XHTML 1.0 - which was issued year 2000 and revised 2002
- is also part of the text/html MIME registration of June 2000.[5] The
MIME contains a general quote of UTF-8 as preferred, but does not talk
about the UTF-8 BOM. XHTML 1.0 itself strangely enough does not reflect
much on whether XML's default encoding(s) with regard to serving XHTMLm
as text/html.[6] Though, it does actually say, appendix C: [7]
Remember, however, that when the XML declaration is not included in a
document, the document can only use the default character encodings
UTF-8 or UTF-16. Here it does sound as if XHTML, even when served
according to appendix C, should subject itself to XML's encoding rules.

So, given the age of the documents, neither HTML4 from 1999 nor the
'text/html' MIME registration, does not permit anyone to be
'intolerant' of the UTF-8 BOM, but neither does it permit anyone to be
'tolerant' of it. It is silent on the issue.


You read silence as not taking sides, which makes sense from your 
viewpoint. Knowing what implementations did (in a pre-1999 time-frame), 
the idea of UTF-8 BOM just didn't really exist, so nobody thought about 
mentioning it.


Regards,   Martin.


RFC3629 says that protocols may restrict usage of the BOM as a
signature.[3] However, text/html does not do offer any such
restrictions. If one sees HTML4 as as tied to RFC2279 as XML up until
and including 4th revision was tied to specific versions of Unicode,
then this has not changed. But would it not be natural to consider that
text/html user agents currently has to consider  RFC3629 as more
normative than RFC2279? I do at least not think that user agents that
want to be conforming pre-HTML5 user agents have any justification for
ignoring the BOM.

[1] http://www.w3.org/TR/html401/appendix/notes#h-B.2.1
[2] http://tools.ietf.org/html/rfc2279
[3] http://tools.ietf.org/html/rfc3629#section-6
[4] http://www.w3.org/TR/2000/WD-xml-2e-2814
[5] http://tools.ietf.org/html/rfc2854
[6] http://www.w3.org/TR/xhtml1/#C_9
[7] http://www.w3.org/TR/xhtml1/#C_1


Thus there is nothing special with regard to the UTF-8 BOM and
pre-HTML5 HTML.

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst


Hello Leif,

On 2012/07/18 4:35, Leif Halvard Silli wrote:


But is the Windows Notepad really to blame?


Pretty much so. There may have been other products from Microsoft that 
also did it, but with respect to forcing browsers and XML parsers to 
accept an UTF-8 BOM as a signature, Notepad was definitely the main 
cause, by far.



OK, it was leading the way.
But can we think of something that could have worked better, in
praxis? And, no, I don't mean 'better' as in 'not leaking the BOM into
HTML'. I mean 'better' as in 'spreading the UTF-8 to the masses'.


UTF-8 is easy and cheap to detect heuristically. It takes a bit more 
work to scan the whole file than to just look at the first few bytes, 
but then I don't think anybody is/was editing 1MB files in Notepad. So 
the BOM/signature is definitely not the reason that UTF-8 spread on the 
Web and elsewhere.


The spread of UTF-8 is due to its strict US-ASCII compatibility. Every 
US-ASCII character/byte represents the same character, and only that 
character, in UTF-8. A plain ASCII file is an UTF-8 file. If 
syntax-significant characters are ASCII, then (close to) nothing may 
need to change when moving from a legacy encoding to UTF-8. On top of 
that, character synchronization is very easy because leading bytes and 
trailing bytes have strictly separate values. From that viewpoint, the 
BOM is a problem rather than a solution.



… snip …

So, given the age of the documents, neither HTML4 from 1999 nor the
'text/html' MIME registration, does not permit anyone to be
'intolerant' of the UTF-8 BOM, but neither does it permit anyone to be
'tolerant' of it. It is silent on the issue.


You read silence as not taking sides, which makes sense from your
viewpoint. Knowing what implementations did (in a pre-1999
time-frame), the idea of UTF-8 BOM just didn't really exist, so
nobody thought about mentioning it.


It is interesting to think about this history. And the fact that it was
unrealized. May be _that_ is due to the fact that, back then, then one
saw XML as the way forward - which meant that there was not the same
need for the UTF-8 BOM due to XML's default to UTF-8.

However, I think there are two ways to interpret Pre-HTML5: Historic,
about 1998. Or current about choices today: 'this browser is fully
dedicated to HTML4 but does not intend to implement HTML5'. Pointing to
HTML4 for lack of BOM implementation, would be a very thin excuse.


I think that a browser fully dedicated to HTML4 but not intending to 
implement HTML5 will eventually die out. If it exists today, it would 
indeed be reasonable to accept the BOM. But that's not because reading 
the spec(s) leads to that as the only conclusion, it's because there's 
content out there that starts with a BOM.


Regards,   Martin.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Martin J. Dürst


Hello Philippe,

On 2012/07/18 3:37, Philippe Verdy wrote:

2012/7/17 Julian Bradfieldjcb+unic...@inf.ed.ac.uk:

On 2012-07-16, Philippe Verdyverd...@wanadoo.fr  wrote:

I am also convinced that even Shell interpreters on Linux/Unix should
recognize and accept the leading BOM before the hash/bang starting
line (which is commonly used for filetype identification and runtime

The kernel doesn't know or care about character sets. It has a little
knowledge of ASCII (or possibly EBCDIC) hardwired, but otherwise it deals
with 8-bit bytes. It has no concept of text file.


Yes I know. But most tools and script should know on which type of
file they are operating on. Unfortunately the tools are as well
agnostic and just rely on things that do not pass the transport
protocols. Such as filename conventions.


Just writing that you are convinced about something a shell should do 
doesn't change anything. Maybe you can create a patch (or a few patches, 
because there quite a few tools out there in the Linux/Unix world) and 
see if you can convince the respective maintainers that it's indeed a 
good idea.


[As others with some amount of Linux/Unix background, I strongly doubt 
that for Linux/Unix, the BOM is a good idea.]


Regards,   Martin.

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst


Hello Jukka,

On 2012/07/17 23:31, Jukka K. Korpela wrote:

2012-07-17 17:11, Leif Halvard Silli wrote:


For instance, early on in 'the Web', some
appeared to think that all non-ASCII had to be represented as entities.


Yes indeed. There's still some such stuff around. It's mostly
unnecessary, but it doesn't hurt.


Actually, above I described an example where it did hurt ...


The situation is comparable to the BOM issue.


In a very general sense, probably yes.


In the old days, it was
considered (with good reasons presumably) safer to omit the BOM than to
use it in UTF-8,


Yes indeed.


and it was considered safer to use entity references
rather than direct non-ASCII data.


Well, the considered in the BOM case applies to everybody (including 
the W3C), but in the character references case, it applies only to 
people who didn't understand how things were working. In fact, although 
RFC 2070 and HTML4 clearly nailed down the interpretation of numeric 
character references to Unicode, there were implementations (the ones I 
know were in the mobile space) past 2000.




To take a more modern example, the native e-mail client on my Android
seems to systematically display character and entity references
literally when displaying message headers with small excerpts of
content, even though it correctly interprets them when displaying the
message itself.


The reason for this may simply be that email bodies can be in HTML, but 
that there is no way at all to use HTML in email header fields.


Regards,   Martin.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Martin J. Dürst


Hello Doug,

On 2012/07/18 0:35, Doug Ewell wrote:

For those who haven't yet had enough of this debate yet, here's a link
to an informative blog (with some informative comments) from Michael
Kaplan:

Every character has a story #4: U+feff (alternate title: UTF-8 is the
BOM, dude!)
http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx

What should be interesting is that this blog dates to January 2005,
seven and a half years ago, and yet includes the following:

But every 4-6 months another huge thread on the Unicode List gets
started


Well, less or more than 4-6 months, but yes.


about how bad the BOM is for UTF-8 and how it breaks UNIX tools
that have been around and able to support UTF-8 without change for
decades


Yes indeed. The BOM and Unix/Linux tools don't work well together.


and about how Microsoft is evil for shipping Notepad that causes
all of these problems


That's a bit overblown, but I guess for a Microsoft employee, it looks 
like this.



and how neither the W3C nor Unicode would have
ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it,


That's true, too. It was indeed Notepad that brought the UTF-8 
BOM/signature to the attention of the W3C and the browser makers.


The problem with the BOM in UTF-8 is that it can be quite helpful (for 
quickly distinguishing between UTF-8 and legacy-encoded files) and quite 
damaging (for programs that use the Unix/Linux model of text 
processing), and that's why it creates so much controversy.


Regards,   Martin.

Re: pre-HTML5 and the BOM

2012-07-18 Thread Martin J. Dürst


On 2012/07/18 16:35, Leif Halvard Silli wrote:

Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900:



The best reason is simply that nobody should be using
crutches as long as they can walk with their own legs.


Crutches, in that sense, is only about authoring convenience. And, of
course, it is a difference between using named and numeric character
references for a single non-ASCII letter as opposed to using it for all
of them. Nevertheless: I, as Web author, would perhaps skip that
convenience if I knew that doing so could improve e.g. HTML5 browser's
ability to sniff the encoding correctly when all other encoding info is
lost. If such sniffing can be an alternative to the BOM, and the BOM is
questionable, then why not mention it as a reason to avoid the crutches?


I'm not sure there are many people for whom using named character 
entities or numeric character references is a convenience. But for those 
for whom it is a convenience, let them use it.


Regards,   Martin.

Re: pre-HTML5 and the BOM

2012-07-18 Thread Martin J. Dürst


Hello Leif,

I think that more and more, we are on the wrong mailing list.

Regards,   Martin.

On 2012/07/18 18:47, Leif Halvard Silli wrote:

Martin J. Dürst, Wed, 18 Jul 2012 17:20:31 +0900:

On 2012/07/18 16:35, Leif Halvard Silli wrote:

Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900:



The best reason is simply that nobody should be using
crutches as long as they can walk with their own legs.


Crutches, in that sense, is only about authoring convenience.
[…] Nevertheless: I, as Web author, would perhaps skip that
convenience if I knew that doing so could improve e.g. HTML5
browser's ability to sniff the encoding correctly […]


I'm not sure there are many people for whom using named character
entities or numeric character references is a convenience. But for
those for whom it is a convenience, let them use it.


By all means: Let them.

But the W3C's I18N working group still gives out advice about when to
(not) use escapes.[1] Advice which the homepage of W3.org breaks -
since every non-ASCII character of http://www.w3.org is escaped.

What the I18N group says in that document, is a bit moralistic (along
the lines 'please think about how difficult it is for non-English
authors to read escapes for all their characters). It seems to me that
a mention of real effects on browser behavior could be a better form of
advice. Especially when coupled with advice about avoiding the BOM.[2]

[1] http://www.w3.org/International/techniques/authoring-html#escapes
[2] http://www.w3.org/International/questions/qa-byte-order-mark#bomhow

Re: Unicode String Models

2012-07-20 Thread Martin J. Dürst


On 2012/07/21 7:01, David Starner wrote:


I'm concerned about the statement/implication that one can optimize
for ASCII and Latin-1. It's too easy for a lot of developers to test
speed with the English/European documents they have around and test
correctness only with Chinese. I see the argument in theory and
practice, but it's a tough line to walk, especially if you're not
familiar with i18n.

I can see for i in range (1, 1000) do a :=  ; a +:= 龜; done being
way slower than necessary (especially for non-trivially optimized away
cases), for example.


The main problem with the above loop isn't ASCII vs. Chinese or some 
such. It's that depending on the way the programming language handles 
Strings, it will result in a painter's algorithm phenomenon (see 
http://www.joelonsoftware.com/articles/fog000319.html).


Regards,   Martin.

Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-21 Thread Martin J. Dürst


Hello Karl,

On 2012/07/21 0:41, Karl Pentzlin wrote:

Looking for an example of plain text which is obvious to anybody,
it seems to me that the Subject field of e-mails is a good example.
Common e-mail software lets you enter any text but gives you never
access to any higher-level protocol. Possibly you can select the font
in which the subject line is shown, but this is completely independent
of the font your subject line is shown at the recipient.
Thus, you transfer here plain text, and you can use exactly the
characters which either Unicode provides to you, or which are PUA
characters which you have agreed upon with the recipient before.

In fact, the de-facto-standard regulating the e-mail content (RFC 2822,
dated April 2001 http://www.ietf.org/rfc/rfc2822.txt , afaik)


No. If you go to http://tools.ietf.org/html/rfc2822, you'll see
Obsoleted by: 5322, Updated by: 5335, 5336.
RFC 5322 is the new version, date October 2008, but doesn't change much.
RFC 5335 and 5336 are experimental for encoding the Subject (and a lot 
of other fields) as raw UTF-8 if the email infrastructure supports it. 
There are Standards Track updates for these two, RFC 6531 and 6532.


But what's more important for your question, at least in theory, is 
http://tools.ietf.org/html/rfc2231, which defines a way to add language 
information to header fields such as Subject:. With such information, it 
would stop to be plain text.


In practice, RFC 2231 is not well known, and even less used, so except 
for detailed technical discussion, your example should be good enough.


Regards,   Martin.



defines the content of the Subject line as unstructured (p.25),
which means that is has to consist of US-ASCII characters, which in
turn can denote other (e.g. Unicode) characters by the application of
MIME protocols. Thus, the result is an unstructured character
sequence.

There is e.g. no possibility to include superscripted/subscripted
characters in a Subject of an e-mail, unless these characters are
in fact included as superscript/subscript characters in Unicode
directly.

Thus, proving the necessity to include a character in the text of a
Subject line of an e-mail, is proving that the character has to be
available as a plain text character. If, additionally, the character
is used outside a closed group (which can be advised to use PUA
characters), then there is a valid argument to include such a
character in Unicode.

Is my assumption correct?

(I think of the SUBSCRIPT SOLIDUS proposed in WG2 N3980.
  It is in fact annoying that you cannot address DIN EN 13501
  requirements in an e-mail subject line written correctly,
  as Unicode, although being an industry standard, until now
  did not listen to an industry request at this special topic.)

- Karl

Re: Character set cluelessness

2012-10-02 Thread Martin J. Dürst

Richard - Complex script usually refers to scripts where rendering isn't 
just simply putting glyphs side by side. That includes stuff with 
combining marks, ligatures, reordering, stacking, and the like.


Regards,   Martin.

On 2012/10/03 7:09, Richard Wordingham wrote:

On Tue, 02 Oct 2012 09:14:08 -0700
Doug Ewelld...@ewellic.org  wrote:


It's 2012. How does one get through to folks like this?


Even people who should know better can get confused about character
sets.  Does anyone know what 'a complex script Unicode range' is?  It's
a term that occurs in the Office Open XML specification, but I
can't find a definition for it.

It's just possible that it means a range where hypothetically unassigned
characters would not be left-to-right, but I've a feeling it ought to
include Vietnamese characters for all that they're Latin script.

Possibly the definitions have not been provided because the concept
ought to involve the tricky task of breaking text runs into script
runs.  (Lots of people feel one should be able to add script-specific
combining marks to U+25CC DOTTED CIRCLE, U+2013 EN DASH and U+00D7
MULTIPLICATION SIGN or perhaps even U+0078 LATIN SMALL LETTER X.
U+0964 DEVANAGARI DANDA is used with the Latin, Devanagari and Tamil
scripts, to name but a few.)

Richard.

Re: Character set cluelessness

2012-10-02 Thread Martin J. Dürst

So in order to get something going here, why doesn't Doug draft a letter 
to these guys (possibly based on the one from a few years ago) and then 
Mark sends it off in his position at Unicode, which hopefully will 
impress them more than just a personal contribution.


Being upset in this list (which I'm too, of course) doesn't change anything.

Regards,   Martin.

On 2012/10/03 6:15, Doug Ewell wrote:

Mark Davis mark at macchiato dot com  wrote:


I tend to agree. What would be useful is to have one column for the
city in the local language (or more columns for multilingual cities),
but it is extremely useful to have an ASCII version as well.


They have two name fields, one (Name) for the name transliterated into
Latin, and a second (NameWoDiacritics) which is an ASCII-smashed
version of the first. Again, that's fine as long as I am free to ignore
the ASCII version. They don't attempt to represent names in non-Latin
scripts, which is not my beef here.

There are many names in the Name (i.e. beyond ASCII) field that
include characters beyond 8859-1, such as œ and ̆z, and certainly many
beyond CP437. This is a good thing (although there are some errors, not
as many as in past years), but they need to fix their documentation to
reflect what they actually do, and not make these irrelevant,
misleading, and/or inaccurate references to 437 and to a 19-year-old
version of 10646.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwellshy;

Re: Missing geometric shapes

2012-11-08 Thread Martin J. Dürst


On 2012/11/08 19:15, Michael Everson wrote:

On 8 Nov 2012, at 09:59, Simon Montagusmont...@smontagu.org  wrote:


Please take into account that the half-stars should be symmetric-swapped in RTL 
text. I attach an example from an advertisment for a movie published in Haaretz 
2 November 2012


I don't think Geometric Shapes have the mirror property.

2605;BLACK STAR;So;0;ON;N;
2606;WHITE STAR;So;0;ON;N;


Well, those are usually symmetric, so adding a mirror property wouldn't 
change much.



In a Hebrew context you'd just choose the star you wanted (black-white vs 
white-black) and use it.


That works well if the text is written by hand. If it is produced as 
part of a script that better work the same for many languages, symmetric 
swapping would really be very helpful.


Regards,   Martin.


Michael Everson * http://www.evertype.com/

Re: Caret

2012-11-14 Thread Martin J. Dürst


On 2012/11/13 21:49, Eli Zaretskii wrote:


I'd welcome that.  Although the reality flies in the face of user
requirements in this case: most bidi-aware editors, including my own
work in Emacs, don't have 2 carets, for some reason.  Maybe the
developers didn't consider that important enough, or maybe it's just
too darn hard...


What's the specific reason in the case of your Emacs work (which I very 
much appreciate!)?


Regards,   Martin.

Re: latin1 decoder implementation

2012-11-17 Thread Martin J. Dürst


Just in case it helps, Ruby (since version 1.9) also uses 3).

Regards,   Martin.

On 2012/11/17 6:48, Buck Golemon wrote:

When decoding bytes to unicode using the latin1 scheme, there are three
options for bytes not defined in the ISO-8859-1 standard.

1) Throw an error.
2) Insert the replacement glyph (fffd), indicating an unknown character.
3) Insert the unicode character with equal value. This means that
completely random bytes will always decode successfully.

The Python language currently implements option three. Is this correct?
There is an option to produce errors or replacements for encodings which
have undefined characters, but as implemented, latin1 currently defines
characters for all 256 bytes, so the option does nothing.

Restated, are the first 256 characters of unicode intended to be exactly
compatible with a latin1 codec?
This would imply that unicode has inserted character definitions into the
ISO-8859-1 standard.

Re: latin1 decoder implementation

2012-11-17 Thread Martin J. Dürst


On 2012/11/17 9:45, Doug Ewell wrote:

If he is targeting HTML5, then none of this matters, because HTML5 says
that ISO 8859-1 is really Windows-1252.


Yes. But unless Python wants to limit its use to HTML5, this should be 
handled on a separate level (mapping a iso-8859-1 label to the 
Windows-1252 decoder logic), not by trying to change ISO-8859-1 itself.


Regards,   Martin.


For example, there is no C1 control called NL in Windows-1252. There is
only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell 


From: Philippe Verdy
Sent: Friday, November 16, 2012 17:35
To: Whistler, Ken
Cc: Buck Golemon ; unicode@unicode.org
Subject: Re: latin1 decoder implementation


In fact not really, because Unicode DOES assign more precise semantics
to a few of these controls, notably for those given whitespace and
newline properties (notably TAB, LF, CR in C0 controls and NL in C1
controls, with a few additional constraints for the CR+LF sequence) as
they are part of almost all plain text protocols ; NUL also has a
specific behavior which is so common that it cannot be mapped to
anything else than a terminator or separator of plain text sequences.

So even if the ISO/IEC 8859 standard does not specify a charecter
mapping in C0 and C1 controls, the registered MIME types are doing so
(but nothing is well defined for the C0 and C1 controls except NUL, TAB,
CR, LF, NL, for MIME usages purpose).

And then yes, the ISO/IEC 8859 standard is different (more restrictive)
from the MIME charsets defined by the IETF in some RFC's (and registered
in the IANA registry), simply because the ISO/IEC standard (encoded
charset) was developed to be compatible with various encoding schemes,
some of them defined by ISO, some others defined by other standard
European or East-Asian bodies (including 7-bit schemes, using escape
sequences, or shift in/out controls).

By itself, the ISO/IEC 8859 is not a complete encoding scheme, it is
just defining several encoded character sets, independantly of the
encoding schme used to store or transport it (it is not even sufficient
to represent any plain-text content).

On the opposite, The MIME charsets named ISO_8859-* registered by
the IETF in the IANA registry are concrete encoding schemes, based on
the ISO/IEC 8859 standard, and suitable for representing a plain-text
content, because the MIME charsets are also adding a text presentation
protocol.

In practice, almost nobody today uses the ISO/IEC 8859 standard alone :
there's always an additional concrete protocol added on top of it (which
generally makes use of the C0 and C1 controls, but not necessarily, and
not always the same way). So plain-text documents never use the ISO/IEC
8859 standard, but the MIME charsets (plus a few specific or proprietary
charsets that have not been registered in the IANA registry as they are
bound to a non-open protocol).

Re: latin1 decoder implementation

2012-11-19 Thread Martin J. Dürst


On 2012/11/17 9:56, Philippe Verdy wrote:

True. HTML5 makes its own reinterpretation of the IETF's MIME standard,
definining it own protocol (which means that it is no longer fully
compatible with MIME and its IANA datatabase, because the mapping of the
value of a charset= pseudo-attribute is not directly to the IETF MIME
standard, but to a newer range of W3C standards).

There was a clear desire from the W3C to deprecate the use of the MIME
standard and its IANA database in HTML, to simplify the implementations


There is no need to deprecate the use of MIME in order to simplify 
implementations. No MIME-compatible implementation is required to accept 
and understand all charsets defined in the IANA registry. There are 
numerous Mime types that restrict the number of possible character 
encodings to a small set, or only require implementation of very few of 
them (XML would be a typical example).



(also to avoid the many incompatibilities that have occured in the past
with MIME charsets between the implementations).


That's the main motivation. One browser started to accept data in a form 
that it shouldn't have accepted. Sloppy content producers started to 
rely on this. Because the browser in question was the dominant browser, 
other browsers had to try and re-engineer and follow that browser, or 
just be ignored. The Encoding Spec is an attempt, hopefully successful, 
to limit these incompatibilities to those that exist today, and not let 
them increase further.



Note also that the W3C
does not automatically endorses the Unicode and ISO/IEC 10646 standards as
well (there's a delay before accepting newer releases of TUS and ISO/IEC
10646, and the W3C frequently adds now several restrictions).


Can you give examples? As far as I'm aware, the W3C has always tried to 
make sure that e.g. new characters encoded in Unicode can be used as 
soon as possible. There are some cases where this has been missed in the 
past (e.g. XML naming rules), but where corrective action has been taken.


Regards,   Martin.



2012/11/17 Doug Ewelld...@ewellic.org


If he is targeting HTML5, then none of this matters, because HTML5 says
that ISO 8859-1 is really Windows-1252.

For example, there is no C1 control called NL in Windows-1252. There is
only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell 


From: Philippe Verdy
Sent: Friday, November 16, 2012 17:35
To: Whistler, Ken
Cc: Buck Golemon ; unicode@unicode.org

Subject: Re: latin1 decoder implementation


In fact not really, because Unicode DOES assign more precise semantics to
a few of these controls, notably for those given whitespace and newline
properties (notably TAB, LF, CR in C0 controls and NL in C1 controls, with
a few additional constraints for the CR+LF sequence) as they are part of
almost all plain text protocols ; NUL also has a specific behavior which is
so common that it cannot be mapped to anything else than a terminator or
separator of plain text sequences.

So even if the ISO/IEC 8859 standard does not specify a charecter mapping
in C0 and C1 controls, the registered MIME types are doing so (but nothing
is well defined for the C0 and C1 controls except NUL, TAB, CR, LF, NL, for
MIME usages purpose).

And then yes, the ISO/IEC 8859 standard is different (more restrictive)
from the MIME charsets defined by the IETF in some RFC's (and registered in
the IANA registry), simply because the ISO/IEC standard (encoded charset)
was developed to be compatible with various encoding schemes, some of them
defined by ISO, some others defined by other standard European or
East-Asian bodies (including 7-bit schemes, using escape sequences, or
shift in/out controls).

By itself, the ISO/IEC 8859 is not a complete encoding scheme, it is just
defining several encoded character sets, independantly of the encoding
schme used to store or transport it (it is not even sufficient to represent
any plain-text content).

On the opposite, The MIME charsets named ISO_8859-* registered by the
IETF in the IANA registry are concrete encoding schemes, based on the
ISO/IEC 8859 standard, and suitable for representing a plain-text content,
because the MIME charsets are also adding a text presentation protocol.

In practice, almost nobody today uses the ISO/IEC 8859 standard alone :
there's always an additional concrete protocol added on top of it (which
generally makes use of the C0 and C1 controls, but not necessarily, and not
always the same way). So plain-text documents never use the ISO/IEC 8859
standard, but the MIME charsets (plus a few specific or proprietary
charsets that have not been registered in the IANA registry as they are
bound to a non-open protocol).

Re: cp1252 decoder implementation

2012-11-21 Thread Martin J. Dürst


On 2012/11/21 16:23, Peter Krefting wrote:

Doug Ewell d...@ewellic.org:


Somewhat off-topic, I find it amusing that tolerance of poorly
encoded input is considered justification for changing the underlying
standards,


The encoding work at W3C, at least as far as I see it, is not an attempt 
to redefine e.g. iso-8859-1 itself. To be blunt, it's just to make clear 
that lots of Web pages out there are lying, and help browsers detect 
this in an uniform way.


This does not mean that all other software has to do the same. Real 
ISO-8859-1 will still be treated correctly by browsers. When you create 
a Web page, if it's really iso-8859-1, then label it as such, but when 
it's actually windows-1252, then label it as such. And make sure it 
doesn't contain any undefined (or C1) codepoints. That way, it will 
interoperate not only with browser, but also with other software.


Also, if you write any kind of tool, feel free to use the narrower 
(real) definition, and to throw up errors. There are very few tools that 
have to accept as wide a range of data and not throw an error as browsers.



when Internet Explorer has been flamed for years and years
for tolerating bad input.


It's called adapting to reality, unfortunately. There are *a lot* of
documents on the web labelled as being iso-8859-1 and/or not labelled
at all, which are using characters from the 1252 codepage. And since
using the 1252 codepage to decode proper iso-8859-1 HTML documents
does not hurt anyone (as HTML up to version 4 explicitly forbids the use
of the control codes in the 0x80-0x9F range), that is what everyone does.


One browser started to accept data in a form that it shouldn't have
accepted. Sloppy content producers started to rely on this. Because
the browser in question was the dominant browser, other browsers had
to try and re-engineer and follow that browser, or just be ignored.

Evidently it's OK if W3C or Python does it, but not if Microsoft does it.


Don't blame Microsoft here, it was Netscape (on Windows) that started
it, by just mapping the iso-8859-1 input data to a windows-1252 encoded
font output. The same pages that would work fine on Windows would show
garbage on Unix, until it was patched to also display it as codepage
1252. Internet Explorer wasn't even published when this happened, and I
can't remember now whether the first versions of it actually did this,
or if it was bolted on later.


Thanks for this correction. Because it was windows-1252, I had assumed 
it was Microsoft.


Regards,   Martin.

Why 17 planes? (was: Re: Why 11 planes?)

2012-11-27 Thread Martin J. Dürst

Well, first, it is 17 planes (or have we switched to using hexadecimal 
numbers on the Unicode list already?


Second, of course this is in connection with UTF-16. I wasn't involved 
when UTF-16 was created, but it must have become clear that 2^16 (^ 
denotes exponentiation (to the power of)) codepoints (UCS-2) wasn't 
going to be sufficient. Assuming a surrogate-like extension mechanism, 
with high surrogates and low surrogates separated for easier 
synchronization, one needs


   2 * 2^n
surrogate-like codepoints to create

   2^(2*n)
new codepoints.

For doubling the number of codepoints (i.e. a total of 2 planes), one 
would use n=8, and so one needs 128 surrogate-like codepoints. With n=9, 
one gets 4 more planes for a total of 5 planes, and needs 512 
surrogate-like codepoints. With n=10, one gets 16 more planes (for the 
current total of 17), but needs 2048 surrogate codepoints. With n=11, 
one would get 64 more planes for a total of 65 planes, but would need 
8192 codepoints. And so on.


My guess is that when this was considered, 1,048,576 codepoints was 
thought to be more than enough, and giving up 8192 codepoints in the BMP 
was no longer possible. As an additional benefit, the 17 planes fit 
nicely into 4 bytes in UTF-8.


Regards,   Martin.

On 2012/11/26 19:47, Shriramana Sharma wrote:

I'm sorry if this info is already in the Unicode website or book, but
I searched and couldn't find it in a hurry.

When extending beyond the BMP and the maximum range of 16-bit
codepoints, why was it chosen to go upto 10 and not any more or
less? Wouldn't F have been the next logical stop beyond , even
if FF (or ) is considered too big? (I mean, I'm not sure
how that extra 64Ki chars [10 minus F] could be important...)

Thanks.

Re: cp1252 decoder implementation

2012-11-27 Thread Martin J. Dürst


On 2012/11/17 12:54, Buck Golemon wrote:

On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewelld...@ewellic.org  wrote:


Buck Golemon wrote:

  Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and

to map it to the equally-non-semantic U+81 ?


U+0081 (there are always at least four digits in this notation) just by 
chance doesn't have any definition. But if we take the next of the 
holes in windows-1258, 0x8D, we get REVERSE LINE FEED. This isn't 
exactly non-semantic (although of course browsers and quite a bit of 
other software ignores that meaning).




Why do you make this conditional on targeting html5?

To me, replacement and error is out because it means the system loses data
or completely fails where it used to succeed.


There are cases where one wants to avoid as many failures as possible, 
at the cost of GIGO (garbage in, garbage out). Browsers are definitely 
in that category.


There are other cases where one wants to catch garbage early, and not 
let it pollute the rest of the data.




Currently there's no reasonable way for me to implement the U+0081 option
other than inventing a new cp1252+latin1 codec, which seems undesirable.


Well, the above two cases cannot be met with one and the same codec 
(unless of course in the case where there are additional options that 
allow to switch between one and the other).




I feel like you skipped a step. The byte is 0x81 full stop. I agree that it
doesn't matter how it's defined in latin1 (also it's not defined in latin1).
The section of the unicode standard that says control codes are equal to
their unicode characters doesn't mention latin1. Should it?
I was under the impression that it meant any single-byte encoding, since it
goes out of its way to talk about 8-bit control codes.


I'd say it intends to apply to any single-byte encoding with a full C1 
range, or in other words, any single-byte encoding conforming to the ISO 
C0/G0/C1/G1 model (that's used if not defined in ISO 2022). So that 
would include any encoding of the ISO-8859-X family but not windows- 
or macintosh encodings.


In other words, the C1 range isn't just a dumping ground for cases where 
the conversion would fail otherwise.



Regards,   Martin.

Re: Why 17 planes?

2012-11-27 Thread Martin J. Dürst

To this, my mother would say: Why keep it simple when we can make it 
complicated?.


Regards,Martin.

On 2012/11/27 21:01, Philippe Verdy wrote:

That's a valid computation if the extension was limited to use only
2-surrogate encodings for supplementary planes.

If we could use 3-surrogate encodings, you'd need
   3*2ˆn surrogates
to encode
   2^(3*n)
new codepoints.

With n=10 (like today), this requires a total of 3072 surrogates, and you
encode 2^30 new codepoints. This is still possible today, even if the BMP
is almost full and won't allow a new range of 1024 surrogates: you can
still use 2 existing surrogates to encode 2048 hyper-surrogates in the
special plane 16 (or for private use in the private planes 14 and 15),
which will combine with the existing low surrogates in the BMP.

Tool to convert characters to character names

2012-12-19 Thread Martin J. Dürst

I'm looking for a (preferably online) tool that converts Unicode 
characters to Unicode character names. Richard Ishida's tools 
(http://rishida.net/tools/conversion/) do a lot of conversions, but not 
names.


Regards,   Martin.

Re: Character name translations

2012-12-20 Thread Martin J. Dürst


On 2012/12/21 0:59, Asmus Freytag wrote:


There have been efforts at a Japanese translation of the text of the
standard, I have no idea whether that contains translated names for
characters.


JIS X 0221-1995, which is a translation of ISO 10646, contains some 
Japanese character names, but this is mostly limited to Japanese (i.e. 
those that appear in the original Japanese JIS X0208) symbols and 
punctuations, and sometimes there are two names for a single character.


I don't know about newer translations.

Regards,   Martin.

Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Martin J. Dürst


On 2013/01/06 7:21, Costello, Roger L. wrote:


Does this mean that when exchanging Unicode data across the Internet the 
endianness is not relevant?

Are these stated correctly:

 When Unicode data is in a file we would say, for example, The file contains 
UTF-32BE data.

 When Unicode data is in memory we would say, There is UTF-32 data in 
memory.

 When Unicode data is sent across the Internet we would say, The UTF-32 data 
was sent across the Internet.


The first is correct. The second is correct. The third is wrong. The 
Internet deals with data as a series of bytes, and by its nature has to 
pass data between big-endian and little-endian machines. Therefore, 
endianness is very important on the Internet. So you would say:


The UTF-32BE data was sent across the Internet.

Actually, as far as I'm aware of, the labels UTF-16BE and UTF-16LE were 
first defined in the IETF, see 
http://tools.ietf.org/html/rfc2781#appendix-A.1.


Because of this, Internet protocols mostly prefer UTF-8 over UTF-16 (or 
UTF-32), and actual data is also heavily UTF-8. So it would be better to 
say:


When Unicode data is sent across the Internet we would say, The UTF-8 
data was sent across the Internet.


Regards,   Martin.

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Martin J. Dürst


On 2013/01/08 3:27, Markus Scherer wrote:


Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned as themselves and treated as such (e.g.,
in collation). That would not be well-formed UTF-16, but it's generally
harmless in text processing.


Things like this are called garbage in, garbage-out (GIGO). It may be 
harmless, or it may hurt you later.


Regards,   Martin.

Re: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Martin J. Dürst


On 2013/01/08 14:43, Stephan Stiller wrote:


Wouldn't the clean way be to ensure valid strings (only) when they're
built


Of course, the earlier erroneous data gets caught, the better. The 
problem is that error checking is expensive, both in lines of code and 
in execution time (I think there is data showing that in any real-life 
programs, more than 50% or 80% or so is error checking, but I forgot the 
details).


So indeed as Ken has explained with a very good example, it doesn't make 
sense to check at every corner.



and then make sure that string algorithms (only) preserve
well-formedness of input?

Perhaps this is how the system grew, but it seems to be that it's
yet another legacy of C pointer arithmetic and
about convenience of implementation rather than a
safety or performance issue.


Convenience of implementation is an important aspect in programming.

 Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.
 So in this kind of a case, what we are actually dealing with is:
 garbage in, principled, correct results out. ;-)

Sorry, but I have to disagree here. If a list of strings contains items 
with lone surrogates (garbage), then sorting them doesn't make the 
garbage go away, even if the items may be sorted in correct order 
according to some criterion.


Regards,   Martin.

Re: Normalization rate on the Web

2013-01-21 Thread Martin J. Dürst


On 2013/01/22 1:12, Denis Jacquerye wrote:

Does anybody have any idea of how much of the Web is normalized in NFC
or NFD? Or how much not normalized?


I have never measured this. But at one time, there was only NFD (and 
NFKD). The Unicode Consortium, with input from W3C, then defined NFC 
(and NFKC) to be much closer to the actual encodings used on the Web.


So in some sense, Web Content is (mostly) NFC *by design*.

Regards,Martin.



How would one find out or try to make a smart guess?

I know a lot of library catalogue data is in NFD or somewhat
decomposed. Is there any other field that heavily uses decomposition?

--
Denis Moyogo Jacquerye
African Network for Localisation http://www.africanlocalisation.net/
Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
DejaVu fonts --- http://www.dejavu-fonts.org/

Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-04 Thread Martin J. Dürst


Hello Roger,

The conclusion to your question below is a very clear NO. The reason is 
that most text is already in NFC. In fact, as I wrote a few days or 
weeks ago, NFC was defined to capture what's usually around on the Web 
(and in other places, too). Trying to recommend that everything be in 
NFD when more than 99% is already in NFC, and that won't change any time 
soon, just doesn't make sense.


Also, most of the statements you have below need more qualifiers. For 
example, only a very, very small minority of people ever needs to input 
all possible composed characters (and on top of that, some clever 
software can do the normalization to NFC while the input in happening).


Regards,Martin.

On 2013/02/03 22:27, Costello, Roger L. wrote:

Hi Folks,

Thank you for your excellent responses.

Based on your responses, I now wonder why the W3C recommends NFC be used for 
text exchanges over the Internet. Aside from the size advantage of NFC, there 
seems to be tremendous advantages to using NFD:

- It’s easier to do searches and other text processing on NFD-encoded text.

- NFD makes the regular expressions used to qualify its contents much, *much* 
simpler.

- Things like fuzzy text matching are probably easier in NFD.

- It’s easier to remember a handful of useful composing accents than the much 
larger number of combined forms.

- It is easier to use a few keystrokes for combining accents than to set up 
compose key sequences for all the possible composed characters.

- Some Unicode-defined processes, such as capitalization, are not guaranteed to 
preserve normalization forms.

- Some operating systems store filenames in NFD encoding.

The W3C is currently updating their recommendations [1]:

 This version of this document was published to
 indicate the Internationalization Core Working
 Group's intention to substantially alter or replace
 the recommendations found here with very different
 recommendations in the near future.

Would you recommend that the W3C change their recommendation from:

 Use NFC when exchanging text over the Internet.

to:

 Use NFD when exchanging text over the Internet.

Would that be your recommendation to the W3C?

/Roger

[1] http://www.w3.org/TR/charmod-norm/

Re: Why wasn't it possible to encode a coeng-like joiner for Tibetan?

2013-04-12 Thread Martin J. Dürst


On 2013/04/11 16:30, Michael Everson wrote:

On 11 Apr 2013, at 00:09, Shriramana Sharmasamj...@gmail.com  wrote:


Or was the Khmer model of an invisible joiner a *later* bright idea?


Yes.


Later, yes. Bright? Most Kambodian experts disagree.

Regards,   Martin.

Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-23 Thread Martin J. Dürst




On 2013/04/23 18:01, William_J_G Overington wrote:

On Monday 22 April 2013, Asmus Freytagasm...@ix.netcom.com  wrote:


I'm always suspicious if someone wants to discuss scope of the standard before 
demonstrating a compelling case on the merits of wide-spread actual use.


The reason that I want to discuss the scope is because there is uncertainty. If 
people are going to spend a lot of time and effort in the research and 
development of a system whether the effort would all be wasted if the system, 
no matter how good and no matter how useful were to come to nothing because it 
would be said that encoding such a system in Unicode would be out of scope.


[I'm just hoping this discussion will go away soon.]

You can develop such a system without using the private use area. Just 
make little pictures out of your characters, and everybody can include 
them in a Web page or an office document, print them, and so on. The 
fact that computers now handle text doesn't mean that text is the only 
thing computers can handle.


Once you have shown that your little pictures are widely used as if they 
were characters, then you have a good case for encoding. This is how 
many symbols got encoded; you can check all the documentation that is 
now public.



A ruling that such a system, if developed and shown to be useful, would be 
within scope for encoding in Unicode would allow people to research and develop 
the system with the knowledge that there will be a clear pathway of opportunity 
ahead if the research and development leads to good results.


As far as I know, the Unicode consortium doesn't rule on eventualities.


So, I feel that wanting to discuss the scope of Unicode so as to clear away 
uncertainty that may be blocking progress in research and development is a 
straightforward and reasonable thing to do.


The main blocking factor is the (limited) usefulness of your ideas. In 
case that's ever solved, the rest will be comparatively easy.


Regards,   Martin.

Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-03 Thread Martin J. Dürst


On 2013/06/22 0:32, Michael Everson wrote:

On 21 Jun 2013, at 16:20, Khaled Hosnykhaledho...@eglug.org  wrote:


Yeah, I don't believe that you can language-tag individual file names for such 
display as that is markup.


Why do you need to? You only need one language, it is not like file names are 
multilingual high quality text books where every fine typographic detail for 
each language have to be respected.


I expect my Latvian filenames to appear as Latvian, and my Marshallese filenames to 
appear as Marshallese. The fact that the encoding was screwed up in the 1990s should not 
oblige compromise on that -- and that is not fine typographic detail.


Quite a few people might expect their Japanese filenames to appear with 
a Japanese font/with Japanese glyph variants, and their Chinese 
filenames to appear with a Chinese font/Chinese glyph variants. But 
that's never how this was planned, and that's not how it works today.


And it's a pretty easy guess that there are quite a few more users with 
Japanese and Chinese filenames in the same file system than users with 
Latvian and Marshallese filenames in the same file system, both because 
both Chinese and Japanese are used by many more people than Latvian or 
Marshallese and because China and Japan are much closer than Latvia and 
the Marshall Islands.


Regards,   Martin.


Only the language that the user care about matters, and this can be easily 
inferred from the system locale, and passed down to the text rendering stack.


For the monolingual user.

Michael Everson * http://www.evertype.com/

Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-05 Thread Martin J. Dürst


On 2013/07/05 16:04, Denis Jacquerye wrote:

On Thu, Jul 4, 2013 at 12:07 PM, Michael Eversonever...@evertype.com  wrote:



The problem is in pretending that a cedilla and a comma below are equivalent 
because in some script fonts in France or Turkey routinely write some sort of 
undifferentiated tick for ç. :-)


Can we make sure we have covered this from the other side? Are there any 
languages where there is a letter where both the form with a cedilla and 
the form with a comma below are used, and are distinguished? In other 
words, are there any languages where a user seeing a wrong form would be 
confused as to what the letter is, rather than being potentially 
surprised or annoyed at the details of the shape?


Regards,   Martin.

Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Martin J. Dürst


On 2013/07/05 17:25, Stephan Stiller wrote:


What I had in mind was more specific: Germans are supposed to convert
[ä,ö,ü,ß] to [ae,oe,ue,ss], though I don't know what's considered
best/legal wrt documents required for entering the US, for example.


I have always used Duerst on plane tickets and the like. On the customs 
form that you have to fill in when entering the US (the green one), I 
always just write Dürst; paper is patient. I have added Durst as an 
additional alternate spelling on a long-term visa application form once, 
just in case.


My impression is that US customs officials are either quite 
knowledgeable or quite tolerant on such issues (or a mixture of both). 
The same applies to customs officials in other countries I have traveled 
to, and other people at airports and such. I guess they get used to 
these cases quite quickly, seeing so many passports each day.


Regards,   Martin.

Re: COMBINING OVER MARK?

2013-10-02 Thread Martin J. Dürst


On 2013/10/02 9:52, Leo Broukhis wrote:

Thanks! That comes out exactly right, although using math markup for
linguistic purposes is, IMO, a stretch.


Why? Surely like in other fields (Math to start with), there somewhere 
is a boundary between plain text and rich text. Of course it's not 
always easy to agree on the exact place of the boundary, but in general, 
most people would agree it's there.


Regards,Martin.


Leo


On Tue, Oct 1, 2013 at 5:24 PM, Mark E. Shoulsonm...@kli.org  wrote:


|With  MathML, you could||use:||
||
||anathematimath**mmultiscriptsnone/mi
mathvariant=romans/mimi mathvariant=romanz/mi/**math| (drop
that in an HTML document and take a look).

This doesn't look like plain text to me.  I don't think it argues in favor
of any sort of combining Z or general combinator mark. This is just what
markup is for.

~mark


On 10/01/2013 08:05 PM, Leo Broukhis wrote:


If my understanding of interlinear annotations is correct, to achieve
similarity with the attached sample some markup will be required as well:

anathematisupU+FFF9zU+**FFFAsU+FFFB/supe.

Leo


On Tue, Oct 1, 2013 at 3:51 PM, Jean-François Colsonj...@colson.eumailto:
j...@colson.eu  wrote:

 Le 01/10/13 15:39, Philippe Verdy a écrit :


 In plain text, we would just use the [s|z] notation without
 care about presentation  font sizes used in the rendered rich
 text page. It correctly represent the intended alternation
 without giving more importance to one base letter.
 But it you wanted to allow plain text search with collators, you
 would need to choose one as the base letter and the other
 one as a combining diacritic with ignored higher-level
 differences, using either US English or British/International
 English to fix the base letter (the other letter would be an
 interlinear annotation for the second orthography, either above
 or below the base letter).



 Interlinear annotation… Yes, of course, you could write
 anathematiU+FFF9zU+FFFAs**U+FFFBe. Halas, the characters
 U+FFF9INTERLINEAR ANNOTATION ANCHOR
 U+FFFAINTERLINEAR ANNOTATION SEPARATOR
 U+FFFBINTERLINEAR ANNOTATION TERMINATOR
 are not supported by any software I know.







 2013/10/1 Steffen Daodesdao...@gmail.com
 mailto:sdao...@gmail.com

 Khaled Hosnykhaledho...@eglug.org

 mailto:khaledho...@eglug.org**  wrote:
  |Using TeX:
  |
  |  \def\s{${}^{\rm s}_{\rm z}$}

 Using groff:

   #!/bin/sh -

   cat  \!  t.trhttp://t.tr

   .de zs
   . nr #1 \\w'z'
   \\Z'\
   \\v'-.25v's\
   \\h'-\\n(#1u'\
   \\v'.5v'z\
   '\
   \\h'\\n(#1u'
   . rr #1
   ..
   Fraterni
   .zs
   e.
   !

   groff t.trhttp://t.trt.pshttp://t.ps
   ps2pdf t.pshttp://t.ps
   rm t.trhttp://t.tr  t.pshttp://t.ps

   exit 0

 (Can surely be tweaked.)

  |Regards,
  |Khaled

 Ciao,

 --steffen


 -- Message transféré --
 From: Khaled Hosnykhaledho...@eglug.org
 mailto:khaledho...@eglug.org**
 To: Leo Broukhisl...@mailcom.commailto:l...@mailcom.com

 Cc: unicode Unicode Discussionunicode@unicode.org
 mailto:unicode@unicode.org

 Date: Tue, 1 Oct 2013 11:09:31 +0200
 Subject: Re: COMBINING OVER MARK?
 On Mon, Sep 30, 2013 at 05:51:09PM -0700, Leo Broukhis wrote:
   Hi All,
 
   Attached is a part of page 36 of  Henry Alford's *The
 Queen's English: a
   manual of idiom and usage (1888)* [
   
http://archive.org/details/**queensenglishman00alfohttp://archive.org/details/queensenglishman00alfo
]
 
   Is the way to indicate alternative s/z spellings used there
 plain text
   (arguably, if it can be done with a typewriter, it is plain
 text)

 I see a typeset book not an output of a typewriter.

   or rich text (ignoring the font size of letters s and z)?
 
   If it's the latter, what's the markup to achieve it?

 Using TeX:

   \def\s{${}^{\rm s}_{\rm z}$}

   49. How are we to decide between {\it s} and {\it z} in
 such words as
   anathemati\s{}e, cauteri\s{}e, criti\-ci\s{}e,
 deodori\s{}e, dogmati\s{}e,
   fraterni\s{}e, and the rest? Many of these are derived from
 Greek
   \bye

 Regards,
 Khaled

Re: ¥ instead of \

2013-10-27 Thread Martin J. Dürst


On 2013/10/23 4:22, Asmus Freytag wrote:

On 10/22/2013 11:38 AM, Jean-François Colson wrote:

Hello.

I know that in some Japanese encodings (JIS, EUC), \ was replaced by a ¥.

On my computer, there are some Japanese fonts where the characters
seems coded following Unicode, except for the \ which remained a ¥.


Yes. I'm using a Japanese Windows 7, and I can't distinguish the two 
glyphs in your message (and won't use any of them).



Is that acceptable from a Unicode point of view?

Are such fonts Unicode considered compliant?


It's one of those things where there isn't a clean solution that's also
backwards compatible.


One idea that I have been floating already years ago is that Microsoft 
with each new release of Windows (and other vendors too, of course) 
tweak the Yen glyph in the respective fonts to loose more and more of 
their horizontal bars and the upper right part of the Y, and slant the 
lower part of the Y more and more.


That would put pressure on applications (mostly financial) that still 
use U+005D with Yen semantics, and help Japanese programmers to move 
from seeing a Yen symbol where they should see a backslash. There are 
enough replacements for the Yen symbol. The usual (i.e. 'half-width') is 
at U+00A6, which came into Unicode from ISO-8859-1 (interesting to note 
that the Yen appears in a rather constrained Western-European encoding). 
There's also a full-width variant.


One thing that I have never checked personally, but which I heard from a 
former colleague who knew a lot of character encoding trivia and 
oddities, is that (at least at some point a few years ago) Japanese MS 
Word would change U+00A6 to U+005D without asking the user. Possibly the 
idea was that this way, the data could be more easily converted back 
from Unicode to Shift_JIS. But in terms of moving away from using U+005D 
with a Yen glyyh, it was definitely counterproductive.


Regards,   Martin.

Re: Request for review: 3023bis (XML media types) makes significant changes

2013-12-18 Thread Martin J. Dürst


Hello Henry,

Some comments on your specific questions, which may trigger some 
additional discussion.


On 2013/12/12 1:43, Henry S. Thompson wrote:

I'm one of the editors of a proposed replacement for RFC3023 [1], the
media type registration for application/xml, text/xml and 3 others.

The draft replacement [2] includes several significant changes in the
handling of information about character encoding:

  * In cases where conflicting information is supplied (from charset
param, BOM and/or XML encoding declaration) it give a BOM, if
present, authoritative status;


I'm a bit uneasy about the fact that we now have BOM (internal) - 
charset (external) - encoding (internal), i.e. 
internal-external-internal, but I guess there is lots of experience in 
HTML 5 for giving the BOM precedence. Also, it will be extremely rare to 
have something that looks like a BOM but isn't, and this combined with 
the fact that XML balks on encoding errors should make things quite robust.



  * It recommends against the use of UTF-32.


UTF-32 has some (limited) appeal for internal representation, but none 
really on the network, and media types are for network interchange, so 
this should be fine, too.


Regards,   Martin.


The interoperability situation in this space is currently poor, with
some tools treating a charset parameter as authoritative, but the HTML
5 spec and most browsers preferring the BOM.  The goal of the draft is
to specify an approach which will promote convergence, while
minimising the risk of damage from backward incompatibilities.

Since these changes overlap with a wide range of technologies, I'm
seeking review outside the relevant IETF mailing list
(apps-disc...@ietf.org) -- please take a look if you can, particularly
at Section 3 [3] and Appendix C [4].

Thanks,

ht

[1] http://tools.ietf.org/html/rfc3023
[2] http://tools.ietf.org/html/draft-ietf-appsawg-xml-mediatypes-06
[3] http://tools.ietf.org/html/draft-ietf-appsawg-xml-mediatypes-06#section-3
[4] http://tools.ietf.org/html/draft-ietf-appsawg-xml-mediatypes-06#appendix-C

Fwd: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-28 Thread Martin J. Dürst

I got informed today by your IT Dept. that the mail below never went 
out. Resent herewith.Martin.

 Original Message 
Subject: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala 
got great reception in Sri Lanka)

Date: Mon, 17 Mar 2014 12:32:15 +0900
From: Martin J. Dürst due...@it.aoyama.ac.jp

On 2014/03/16 14:36, Philippe Verdy wrote:

You may still want to promote it at some government or education
institution, in order to promote it as a national standard, except that
there's little change it will ever happen when all countries in ISO have
stopoed working on standardization of new 8-bit encodings (only a few ones
are maintained; but these are the most complex ones used in China and Japan.

Well in fact only Japan now seens to be actively updating its legacy JIS
standard; but only with the focus of converging it to use the UCS and solve
ambiguities or solve some technical problems (e.g. with emojis used by
mobile phone operators). Even China stopped updating its national standard
by publishing a final mapping table to/from the full UCS (including for
characters still not encoded in the UCS): this simplified the work because
only one standard needs to be maintained instead of 2.

I'm not aware of any activity in Japan regarding the update of legacy
character encodings. Can you tell me what you mean by actively updating?

Regards,   Martin.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Fwd: Re: Romanized Singhala got great reception in Sri Lanka

2014-03-28 Thread Martin J. Dürst

I got informed today by your IT Dept. that the mail below never went 
out. Resent herewith.Martin.

 Original Message 
Subject: Re: Romanized Singhala got great reception in Sri Lanka
Date: Mon, 17 Mar 2014 14:37:00 +0900
From: Martin J. Dürst due...@it.aoyama.ac.jp

On 2014/03/17 13:16, Jean-François Colson wrote:

As for Japanese (and also for Indic) I have read the warnings in RFC
1815:
http://tools.ietf.org/rfc/rfc1815.txt

RFC 1815   Character Sets ISO-10646 and ISO-10646-J-1  July 1995

July 1995… Is that document up-to-date?

No, it's not. Not at all. It was outdated when it was published, and
expresses only the opinions of the author (who was well know for not
liking, and not very well understanding, Unicode).

It's labeled as Informational, which means it is not in any way part
of an IETF Standard/specification. Even April 1st RFCs are classified as
Informational.

The charset label ISO-10646-J-1 it defines is listed at
http://www.iana.org/assignments/character-sets/character-sets.xhtml, but
I don't think that there is any major conversion library that supports
this. Similar for what RFC 1815 labels as ISO-10646, which appears as
ISO-10646-Unicode-Latin1 in the IANA registry (because simply using
ISO-10646 for this would be strongly misleading).

Regards,   Martin.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: FYI: More emoji from Chrome

2014-04-01 Thread Martin J. Dürst

Now that it's no longer April 1st (at least not here in Japan), I can 
add a (moderately) serious comment.


On 2014/04/02 01:43, Ilya Zakharevich wrote:

On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ☕️ wrote:

More emoji from Chrome:

http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html

with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y


I do not know… The demos leave me completely unimpressed: emoji — by
their nature — require higher resolution than text, so an emoji for
“pie” does not save any place comparing to the word itself.  So the
impact of this on everyday English-languare communication would not be
in any way beneficial.


This is somewhat different for Japanese (and languages with similar 
writing systems) because they have higher line height.


Regards,   Martin.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: FYI: More emoji from Chrome

2014-04-02 Thread Martin J. Dürst


On 2014/04/02 20:08, Christopher Fynn wrote:

On 02/04/2014, Asmus Freytag asm...@ix.netcom.com wrote:

On 4/2/2014 1:42 AM, Christopher Fynn wrote:

Rather than Emoji it might be better if people learnt Han ideographs
which are also compact (and  a far more developed system of
communication than emoji). One  CJK character can also easily replace
dozens of Latin characters - which is what is being claimed for emoji.


One wonders why the Japanese, who already know Han ideographs, took to
emoji as they did


Perhaps because emoji are a sort of playful version of  a means of
communication they are already used to


Yes. Already used to the concept that a character can represent (more or 
less) a concept. Already used to the concept that there are lots of 
characters, and a few more won't make such a difference. Already used to 
the concept that character entry means keying a word or phrase and the 
selecting what you actually want.


But I think the main reason for their spread was that the mobile phone 
companies introduced them and young people found them cute.


In a followup, Line (http://line.me/en/), the most popular Japanese 
mobile message app (similar to WhatsApp) got popular mostly because of 
their gorgeous collection of 'stickers' (over 10,000), fortunately after 
realizing that the technically correct way to deal with them was not 
squeezing them into the PUA, but treating them as inline images, 
avoiding headaches down the line for the Unicode Consortium :-).


Regards,   Martin.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Emoji

2014-04-02 Thread Martin J. Dürst


On 2014/04/03 02:00, James Lin wrote:

Emoji or 顔文字, literally means Face word or Face Characters, essentially,


Emoji is 絵文字 (picture character), 顔文字 is kaomoji (face character).

Regards,   Martin.


provides an emotional state in the context of words.  Emoji is very
popular in APJ, and specially in Japan where most of your text will
contain at least half dozen Emoji characters.  Remember, people in Japan
spend more than half of their commute in the train, and no talk on the
cellphone in the train, so most people text instead.

Everyone can guess what are the following emoji that used frequently in
Japan:

ヽ(￣д￣;)ノ - worried



ヾ(＠゜▽゜＠）ノ - happy

ヽ(#`Д´)ノ - angry

【・_・?】- confused


there is a lot more...


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

2014-06-03 Thread Martin J. Dürst


On 2014/06/03 07:08, Asmus Freytag wrote:

On 6/2/2014 2:53 PM, Markus Scherer wrote:

On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com
mailto:prosfil...@gmail.com wrote:

I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious bugs in some lamebrained utility.


I don't expect handling these in web browsers and lamebrained
utilities. I expect treat like unassigned code points.


Expecting them to be treated like unassigned code points shows that 
their use is a bad idea: Since when does the Unicode Consortium use 
unassigned code points (and the like) in plain sight?



I can't shake the suspicion that Corrigendum #9 is not actually solving
a general problem, but is a special favor to CLDR as being run by
insiders, and in the process muddying the waters for everyone else.


I have to fully agree with Asmus, Richard, Shawn and others that the use 
of non-characters in CLDR is a very bad and dangerous example.


However convenient the misuse of some of these codepoints in CLDR may 
be, it sets a very bad example for everybody else. Unicode itself should 
not just be twice as careful with the use of its own codepoints, but 10 
times as careful.


I'd strongly suggest that completely independent of when and how 
Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked 
out for how to get rid of these codepoints in CLDR data. The sooner, the 
better.


Regards,   Martin.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Request for Information

2014-07-24 Thread Martin J. Dürst


On 2014/07/24 15:37, Richard Wordingham wrote:


No.  The text samples I could find quickly show scripta continua, but I
suspect the line breaks are occurring at word or syllable boundaries.
If I am right about the constraint on line break position, then this
can be recovered by marking the optional line breaks with ZWSP.  In
addition, the consonants should be reclassified from AL to SA.
However, such a change would be incompatible with a modern writing
system in which words are separated by spaces (if such exists). I don't
know what happens in Indonesian schools, so I can't report an error.
Scripta continua and non-scripta continua in the same script are
incompatible in plain text.


Shouldn't that be scripta non-continua ?

Regards,  Martin.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Code charts and code points (was: Re: fonts for U7.0 scripts)

2014-10-24 Thread Martin J. Dürst


On 2014/10/24 10:21, Asmus Freytag wrote:

Peter is correct.

The only fonts that should be released to the public are those that are
Unicode encoded and have the correct shaping tables.

Unlike the public, the code chart editors for Unicode have tools that
can correctly handle not only ASCII-hacked fonts as well as PUA-assigned
fonts, but also fonts that use the wrong Unicode encoding (because
they were designed for an earlier draft with different code point
assignments). These tools ignore all shaping tables, so the lack of such
tables isn't an issue.

The documents created by the code charts editors are no editable in the
normal sense, so they can be published without causing problems, like
establishing a de-facto encoding. They don't contain running text in
these fonts, so there isn't an issue with search - the searchable
contents are all character names, annotations etc in Latin letters and
digits.

Releasing such fonts to the public would establish a de-facto
non-sanctioned encoding, because people could create (and interchange)
running text using them.


Hello Asmus,

The code charts are published as PDFs. In general, text in PDFs can be 
copypasted elsewhere. Is there something in place that makes sure that 
wrong Unicode encodings for glyphs published in code charts don't leak 
elsewhere?


Regards,   Martin.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: emoji are clearly the current meme fad

2014-12-17 Thread Martin J. Dürst


On 2014/12/18 06:49, Michael Everson wrote:

Clearly the plural of emoji is emojis.


Not in Japanese, where there are no plural forms. The question of what 
it is/will be in English will be decided by usage, not by grammar. I'd 
use 'emoji', but then I'm too biased towards Japanese to be relevant to 
make any predictions.


Regards,   Martin.


On 16 Dec 2014, at 12:36, Asmus Freytag asm...@ix.netcom.com wrote:


Everybody wants in on the act:

http://mashable.com/2014/12/12/bill-nye-evolution-emoji/

A./
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Michael Everson * http://www.evertype.com/

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Unicode encoding policy

2014-12-23 Thread Martin J. Dürst


On 2014/12/24 09:50, Tex Texin wrote:

True, however as William points out, apparently the rules have changed,


I hope the rules get clarified to clearly state that these are exceptions.


so it isn’t unreasonable to ask again whether the rules now allow it, or if 
people that dismissed the idea in the past would now consider it.



Personally, I think this is the wrong place for it, and as has been suggested 
numerous times, it makes sense to host the discussion elsewhere among 
interested parties.



Although, I am not interested in the general case, there is a need for 
specialized cases. Just as some road sign symbols are near universal,


Actually not. I have been driving (and taking drivers' licences tests) 
in Switzerland, Japan, and the US. There are lots of similarities, but 
it'd be difficult for me to come up with an example where they are all 
identical (up to glyph/design differences).


Please see for yourself e.g. at:
https://en.wikipedia.org/wiki/Road_signs_in_Switzerland
http://www.japandriverslicense.com/japanese-road-signs.asp
https://en.wikipedia.org/wiki/Road_signs_in_the_United_States

In the US, there are also differences by state.


there is a need for symbols for quick and universal communications in 
emergencies. Identifying places of safety or danger on a map, or for the 
injured to describe symptoms, pains, and the nature of their injury (or first 
aid workers to discuss victims’ issues), or to describe the nature of a 
calamity (fire, landslide, bomb, attack, etc.), etc.


Such symbols mostly already exist. For a quick and easy introduction, 
see e.g. http://www.iso.org/iso/graphical-symbols_booklet.pdf.


If use of such symbols is found in running text, or if there is a strong 
need to use them in running text, some of these might be added to 
Unicode in the future. But they wouldn't be things invented out of the 
blue for marketing purposes, they would be well established already.



William, You might consider identifying where there are needs for such 
universal text, and working with groups that would benefit, to get support for 
universal text symbols.


So the first order of business for William (or others) should be to 
investigate what's already around.


Regards,   Martin.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Compatibility decomposition for Hebrew and Greek final letters

2015-02-19 Thread Martin J. Dürst

On 2015/02/20 05:17, Eli Zaretskii wrote:

From: Philippe Verdy verd...@wanadoo.fr
Date: Thu, 19 Feb 2015 20:31:07 +0100
Cc: Julian Bradfield jcb+unic...@inf.ed.ac.uk,
unicode Unicode Discussion unicode@unicode.org

The decompositions are not needed for plain text searches, that can use the
collation data (with the collation data, you can unify at the primary level
differences such as capitalisation and ignore diacritics, or transform some
base groups of letters into a single entry, or make some significant primary
difference when there are diacritics (for example in German equating 'ae' and
'ä' at the primary level).

Sorry, I disagree.  First, collation data is overkill for search,
since the order information is not required, so the weights are simply
wasting storage.  Second, people do want to find, e.g., ² when they
search for 2 etc.  I'm not saying that they _always_ want that, but
sometimes they do.  There's no reason a sophisticated text editor
shouldn't support such a feature, under user control.

Well, for cased scripts, search is usually case-insensitive, but case 
conversions aren't given by compatibility decompositions.

If the question isn't Why are there equivalences useful for search that 
are not covered by compatibility decompositions?, but Why doesn't 
Unicode provide some data for final/non-final Hebrew letter 
correspondence?, maybe the answer is that it hasn't been seen as a need 
up to now because it's so easy to figure out.

Regards,   Martin.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Compatibility decomposition for Hebrew and Greek final letters

2015-02-19 Thread Martin J. Dürst


On 2015/02/19 20:47, Julian Bradfield wrote:

On 2015-02-19, Eli Zaretskii e...@gnu.org wrote:

Does anyone know why does the UCD define compatibility decompositions
for Arabic initial, medial, and final forms, but doesn't do the same
for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM?  Or for
that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA?


As far as I understand it:
In Arabic, the variant of a letter is determined entirely by its
position, so there is no compelling need to represent the forms separately
(as characters rather than glyphs) save for the existence of legacy
standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the
forms would not have been encoded but for the legacy standards.
Whereas in Hebrew, non-final forms appear finally in certain contexts
in normal text; and in Greek, while Greek text may have a determinate
choice between σ and ς, there are many contexts where the two symbols
are distinguished (not least maths).


Digging a bit deeper, the phenomenon of a letter changing shape 
depending on position is pervasive in Arabic, and involves complicated 
interdependencies across multiple characters in good-quality typography. 
But in Hebrew, this phenomenon is minor, and marginal in Greek, and 
typographic interactions are also very limited.


That led to (after some initial tries with alternatives) different 
encoding models. In Arabic, shaping is the job of the rendering engine, 
whereas in Hebrew and Greek, it's part of the encoding.


As for determinate choice between σ and ς, John Cowan once gave an 
example of a Greek word (composed of two original words) with a final 
sigma in the middle.


Regards,   Martin.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: The NEW Keyboard Layout—IEAOU

2015-01-25 Thread Martin J. Dürst


What's better on this keyboard when compared to the Dvorak layout?
At first sight, it looks heavily right-handed, all the letters that the 
Dvorak keyboard has on the homerow are on the right hand.


Regards,   Martin.

P.S.: I'm a happy Dvorak user.

On 2015/01/26 06:54, Robert Wheelock wrote:

Hello!

I came up with a BRAND-NEW keyboard layout designed to make typing
easier——named the IEAOU (ee-eh-ah-oh-oo) System—based on letter frequencies.

The letters in the new IEAOU layout are arranged as follows:

(TOP):  Digits / Punctuation / Accents
(MEDIAL):  Q Y :|; W |' L N D T S H +|= \|!
(HOME):  X K G F ´|` P I E A O U
(BOTTOM):  C J Z V B M R |, |. ?|/

Please respond to air what you’d think of it.  Thank You!



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread Martin J. Dürst


On 2015/06/04 17:03, Chris wrote:


I wish Steve Jobs was here to give this lecture.


Well, if Steve Jobs were still around, he could think about whether (and 
how many) users really want their private characters, and whether it was 
worth the time to have his engineers working on the solution. I'm not 
sure he would come to the same conclusion as you.



This whole discussion is about the fact that it would be technically possible 
to have private character sets and private agreements that your OS downloads 
without the user being aware of it.

Now if the unicode consortium were to decide on standardising a technological 
process whereby rendering engines could seamlessly download representations of 
custom characters without user intervention, no doubt all the vendors would 
support it, and all the technical mumbo jumbo of installing privately agreed 
character sets would be something users could leave for the technology to sort 
out.


You are right that it would be strictly technically possible. Not only 
that, it has been so for 10 or 20 years.


As an example, in 1996 at the WWW Conference in Paris I was 
participating in a workshop on internationalization for the Web, and by 
chance I was sitting between the participant from Adobe and the 
participant from Microsoft. These were the main companies working on 
font technology at that time, and I asked them how small it would be 
possible to make a font for a single character using their technologies 
(the purpose of such a font, as people on this thread should be able to 
guess, would be as part of a solution to exchange single, user-defined 
characters).


I don't even remember their answers. The important thing here that the 
idea, and the technology, have been around for a long time. So why 
didn't it take on? Maybe the demand is just not as big as some 
contributors on this list claim.


Also, maybe while the technology itself isn't rocket science, the 
responsible people at the relevant companies have enough experience with 
technology deployment to hold back. To give an example of why the 
deployment aspect is important, there were various Web-like hypertext 
technologies around when the Web took off in the 1990. One of them was 
called HyperG. It was technologically 'better' than the Web, in that it 
avoided broken links. But it was much more difficult to deploy, and so 
it is forgotten, whereas the Web took off.


Regards,   Martin.

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Martin J. Dürst

On 2015/06/03 07:55, Chris wrote:

As you point out, The UCS will not encode characters without a demonstrated
usage.”. But there are use cases for characters that don’t meet UCS’s criteria for a
world wide standard, but are necessary for more specific use cases, like specialised
regional, business, or domain specific situations.

Unicode contains *a lot* of characters for specialized regional,
business, or domain specific situations.

My question is, given that unicode can’t realistically (and doesn’t aim to)
encode every possible symbol in the world, why shouldn’t there be an EXTENSIBLE
method for encoding, so that people don’t have to totally rearchitect their
computing universe because they want ONE non-standard character in their
documents?

As has been explained, there are technologies that allow you to do (more
or less) that. Information technology, like many other technologies,
works best when finding common cases used by many people. Let's look at
some examples:

Character encodings work best when they are used widely and uniformly. I
don't know anybody who actually uses all the characters in Unicode
(except the guys that work on the standard itself). So for each
individual, a smaller set would be okay. And there were (and are)
smaller sets, not for individuals, but for countries, regions, scripts,
and so on. Originally (when memory was very limited), these legacy
encodings were more efficient overall, but that's no longer the case. So
everything is moving towards Unicode.

Most Website creators don't use all the features in HTML5. So having
different subsets for different use cases may seem to be convenient. But
overall, it's much more efficient to have one Hypertext Markup Language,
so that's were everybody is converging to.

From your viewpoint, it looks like having something in between
character encodings and HTML is what you want. It would only contain the
features you need, and nothing more, and would work in all the places
you wanted it to work. Asmus's inline text may be something similar.

The problem is that such an intermediate technology only makes sense if
it covers the needs of lots and lots of people. It would add a third
technology level (between plain text and marked-up text), which would
divert energy from the current two levels and make things more complicated.

Up to now, such as third level hasn't emerged, among else because both
existing technologies were good at absorbing the most important use
cases from the middle. Unicode continues to encode whatever symbols that
gain reasonable popularity, so every time somebody has a real good use
case for the middle layer with a symbol that isn't yet in Unicode, that
use case gets taken away. HTML (or Web technology in general) also
worked to improve the situation, with technologies such as SVG and Web
Fonts.

No technology is perfect, and so there are still some gaps between
character encoding and markup, some of which may in due time eventually
be filled up, but I don't think a third layer in the middle will emerge
soon.

Regards, Martin.

Re: International Register of Coded Character Sets

2015-06-21 Thread Martin J. Dürst



On 2015/06/22 05:37, Frédéric Grosshans wrote:

I don't know if it's what you're looking for but Google brought me to the
following URL.
https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf
I managed to download the pdf without problems. I also successfully
downloaded a standard  ( http://www.itscj.ipsj.or.jp/iso-ir/169.pdf ) to
check the URLs from the register.


I was able to access https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/, 
but that just says page not found in Japanese. Same for
https://www.itscj.ipsj.or.jp/ISO-IR/, 
http://www.itscj.ipsj.or.jp/ISO-IR/, and 
http://www.itscj.ipsj.or.jp/itscj_english/iso-ir/

(the http versions redirect to the https versions).

I left a note on their contact page 
(https://www.itscj.ipsj.or.jp/contact/index.html), in Japanese. I'll 
tell you when I hear back from them. If I don't, I'll call them; I 
remember having done that a few years ago.


Regards,Martin.


Le dim. 21 juin 2015 19:41, Doug Ewell d...@ewellic.org a écrit :


Does anyone know what happened to the International Register of Coded
Character Sets page at http://kikaku.itscj.ipsj.or.jp/ISO-IR/ ? This is
the repository for character sets registered for use with ISO 2022.

The page was redirected to a general we've reorganized our site page a
few weeks ago, and now the entire site seems to be down.

--
Doug Ewell | http://ewellic.org | Thornton, CO 

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Martin J. Dürst


On 2015/05/29 11:37, John wrote:


If I had a large document that reused a particular character thousands of times,


Then it would be either a very boring document (containing almost only 
that same character) or it would be a very large document.



would this HTML markup require embedding that character thousands of times, or 
could I define the character once at the beginning of the sequence, and then 
refer back to it in a space efficient way?


If you want space efficiency, the best thing to do is to use generic 
compression. Many generic compression methods are available, many of 
them are widely supported, and all of them will be dealing with your 
case in a very efficient way.



Given that its been agreed that private use ranges are a good thing,


That's not agreed upon. I'd say that the general agreement is that the 
private ranges are of limited usefulness for some very limited use cases 
(such as designing encodings for new scripts).



and given that we can agree that exchanging data is a good thing,


Yes, but there are many other ways to do that besides Unicode. And for 
many purposes, these other ways are better suited.



maybe something should bring those two things together. Just a thought.


Just a 'non sequitur'.

Regards,   Martin.

Re: Emoji characters for food allergens

2015-07-29 Thread Martin J. Dürst




On 2015/07/29 23:27, Andrew West wrote:

On 29 July 2015 at 14:42, William_J_G Overington



My diet can include soya


There already is, you can write My diet can include soya.

If you are likely to swell up and die if you eat a peanut (for
example), you will not want to trust your life to an emoji picture of
a peanut which could be mistaken for something else


Yes, in the worst case for something like I like peanuts.


or rendered as a
square box for the recipient.  There may be a case to be made for
encoding symbols for food allergens for labelling purposes, but there
is no case for encoding such symbols as a form of symbolic language
for communication of dietary requirements.

Andrew
.

Re: Mark-up to Indicate Words

2015-07-15 Thread Martin J. Dürst


Hello Richard,

On 2015/07/15 16:49, Richard Wordingham wrote:

What mark-up schemes exist to show that a sequence of letters and
combining marks constitutes a single word?

Such mark-up would be useful when using spell checkers. At present, I
use U+2060 WORD JOINER (WJ) to indicate the absence of a word boundary.
(Systematic marking of boundaries using ZWSP is not popular with
users, and is normally not used in Thai - it's not supported in
their national or Windows 8-bit encodings.) However, it seems likely
that when Unicode 8.00 is defined in August, WJ will suppress line
breaks but not word breaks.  There would still be the limitation that
mark-up is not available in plain text.

It appears that, for example, Open Document Format has no mark-up to
indicate word boundaries, relying instead on the overrides of
the word boundary detection algorithms being stored at character level.


I'd suggest looking at higher-end formats such as DITA or TEI (Text 
Encoding Initiative).


Regards,   Martin.


Richard.
.

Re: A Bulldog moves on

2015-10-24 Thread Martin J. Dürst


Hello Doug,

Thanks for making us aware of this very sad event. Michael did a lot for 
Unicode, and fought bravely with his illness. I hope we can all remember 
him this week at the Unicode Conference, where he gave so many amazing 
talks.


I also hope that somebody somehow will be able to conserve all his 
tremendously instructive and funny blogs.


Regards,   Martin.

On 2015/10/25 07:57, Doug Ewell wrote:

I wish this day had never come.

http://obits.dignitymemorial.com/dignity-memorial/obituary.aspx?n=Michael-Kaplan=4246=176192738=ec6b8cda-c4b1-4b5f-9422-925b1e09a03a


--
Doug Ewell | http://ewellic.org | Thornton, CO 

.

1 2 3 >

1 - 100 of 228 matches

Mail list logo