Emoji sequences are not _encoded_, per se, in either Unicode or ISO/IEC 10646. 
The act of "encoding" in either of these coding standards is to assign an 
encoded representation in the encoding method of the standards for a given 
entity. In this case, that means to assign a code point. 

Specifying ZWJ sequences for representation of text elements is not encoding in 
the standard; it is simply defining an encoded representation for those text 
elements. Unicode gives some attention to this kind of thing, but ISO/IEC 
10646, not so much. For instance, you won't find anything in ISO/IEC 10646 
specifying that the encoded representation for a rakaar is < VIRAMA, RA >.

So, your helpful person was, indeed, helpful, giving you correct information: 
ZWJ sequences are not _characters_ and have no implications for ISO/IEC 10646.


Peter

-----Original Message-----
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of William_J_G 
Overington via Unicode
Sent: Monday, May 15, 2017 7:57 AM
To: unicode@unicode.org
Subject: Are Emoji ZWJ sequences characters?

I am concerned about emoji ZWJ sequences being encoded without going through 
the ISO process and whether Unicode will therefore lose synchronization with 
ISO/IEC 10646.

I have raised this by email and a very helpful person has advised me that 
encoding emoji sequences does not mean that Unicode and ISO/IEC 10646 go out of 
being synchronized because ZWJ sequences are not *characters*, and they have no 
implications for ISO/IEC 10646, noting that ISO/IEC 10646 does not define ZWJ 
sequences. 

Now I have great respect for the person who advised me. However I am a 
researcher and I opine that I need evidence.

Thus I am writing to the mailing list in the hope that there will be a 
discussion please.

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Freports%2Ftr51%2Ftr51-11.html&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636304584722863385&sdata=IWXir%2BfVIg2NW5Q95ClTs5Powet54k5VFEyJaEL7KYE%3D&reserved=0
 (A proposed update document)

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2FPublic%2Femoji%2F5.0%2Femoji-zwj-sequences.txt&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636304584722863385&sdata=2TzPVAvyTRaLqFBx8gKG%2BvwK86DTzcZgnQpPYuaQto8%3D&reserved=0

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Fcharts%2FPDF%2FU1F300.pdf&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636304584722863385&sdata=aG3AQEN8iwsyJtcLZFdKYBsM682sGCuBDUTyf8lyhy4%3D&reserved=0

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Fcharts%2FPDF%2FU1F680.pdf&data=02%7C01%7Cpetercon%40microsoft.com%7C5ed7d97f20194242d58908d49ba6034d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636304584722863385&sdata=xC2tM5TFs9XLDbbYqfTaeVULxe8ciShAlgbWGQfknPg%3D&reserved=0

In tr51-11.html at 2.3 Emoji ZWJ Sequences

quote

To the user of such a system, these behave like single emoji characters, even 
though internally they are sequences.

end quote

In emoji-zwj-sequences.txt there is the following line.

1F468 200D 1F680                            ; Emoji_ZWJ_Sequence  ; man 
astronaut 

>From U1F300.pdf, 1F468 is MAN

200D is ZWJ

>From U1F680.pdf 1F680 is ROCKET

The reasoning upon which I base my concern is as follows.

0063 is c

0070 is p

0074 is t

If 0063 200D 0074 is used to specifically request a ct ligature in a display of 
some text, then the meaning of 0063 200D 0074 is the same as the meaning of 
0063 0074 and indeed a font with an OpenType table could cause a ct ligature to 
be displayed even if the sequence is 0063 0074 rather than the sequence 0063 
200D 0074 that is used where the ligature glyph is specifically requested. Thus 
the meaning of ct is not changed by using the ZWJ character.

Now the use of the ct ligature is well-known and frequent.

Suppose now that a fontmaker is making a font of his or her own and decides to 
include a glyph for a pp ligature, with a swash flourish joining and going 
beyond the lower ends of the descenders both to the left and to the right.

The fontmaker could note that the ligature might be good in a word like copper 
but might look wrong in a word like happy due to the tail on the letter y 
clashing with the rightward side of the swash flourish. So the fontmaker 
encodes 0070 200D 0070 as a pp ligature but does not encode 0070 0070 as a pp 
ligature, so that the ligature glyph is only used when specifically requested 
using a ZWJ character.

However, when the ZWJ character is used, the meaning of the pp sequence is not 
changed from the meaning when the pp sequence is not used.

Yet when 1F468 200D 1F680 is used, the meaning of the sequence is different 
from the meaning of the sequence 1F468 1F680 such that the meaning of 1F468 
200D 1F680 is listed in a file available from the Unicode website.

>From where does the astronaut's spacesuit and helmet come?

I am reminded that in chemistry if one mixes two chemicals, sometimes one just 
gets a mixture of two chemicals and sometimes one gets a chemical reaction such 
that another chemical is produced.

Repeating the quote from earlier in this post.

In tr51-11.html at 2.3 Emoji ZWJ Sequences

quote

To the user of such a system, these behave like single emoji characters, even 
though internally they are sequences.

end quote

I am concerned that in the future a user of ISO/IEC 10646 will not be able to 
find from ISO/IEC 10646 the meaning of an emoji that he or she observes being 
displayed, even if he or she is able to discover what is the sequence of 
characters being used.

So I ask that this matter be discussed please.

William Overington

Monday 15 May 2017


Reply via email to