Re: [Jprogramming] Writing help needed: surrogate pairs

'robert therriault' via Programming Fri, 13 Sep 2019 17:42:16 -0700

Thanks Bill,

I think that I can still safely describe the use of surrogate pairs in unicode 
and unicode4 though, because it seems that those encodings do follow the utf-16 
approach, right?


Also, it seems to me that the byte encoding does follow the utf-8 approach, 
where the lower 127 ASCII characters are just a special case of utf-8 encoding, 
or is that the wrong way to look at that.

    8 u: 65
A
   3 u: 8 u: 65
65
   65 { a.
A
    8 u: 162
¢
   3 u: 8 u: 162
194 162
   162 { a.
�
   194 162 { a.
¢

Cheers, bob

> On Sep 13, 2019, at 3:42 PM, bill lam <[email protected]> wrote:
> 
> Bob,
> 
> You are right about the range of valid unicode codepoint for characters.
> 
> The J document perhaps didn't place enough emphasis on the difference
> between unicode2/unicode4 J datatype and utf16/utf32 encoding.
> 
> Putting it more explicitly, unicode2/unicode4 J datatype has nothing to do
> with utf16/utf32 unicode encoding. unicode2 is 16-bit wide literal and all
> bit patterns are always valid, similar for unicode4 . However this may make
> users more confusing  😂
> 
> automatic type promotion from byte to unicode2 to unicode4 always done atom
> by atom.
> 
> 
> On Sat, Sep 14, 2019, 1:08 AM 'robert therriault' via Programming <
> [email protected]> wrote:
> 
>> Sorry to take so long to come up with information on the surrogate pairs,
>> Henry, but it is both more complicated and simpler than it all appears.
>> 
>> First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a
>> gap from 16bd800 to 16bdfff (which is reserved for the surrogate pairs).
>> This means that the only valid codepoints that can represent characters are
>> 0 to 16bd7ff and 16be000 to 16b10ffff. The three different encoding schemes
>> for Unicode are UTF-8, UTF-16 and UTF-32 (the number indicating the number
>> of bits in the code unit).
>> 
>> UTF-32 has enough bits to represent all of the Unicode code points as
>> single integers, but as mentioned above there are integers that do not
>> represent valid codepoints (greater than 1b10ffff or in the surrogate pair
>> gap) J's unicode4 seems to violate this by allowing surrogate pairs as
>> valid unicode4
>> 
>>   9 u:  55357 56832 NB. A surrogate pair
>> 😀
>>   $ 9 u: 55357 56832  NB. UTF-32 is always one integer
>> 2
>>   3 u: 9 u: 55357 56832
>> 55357 56832  NB. Keeps result as a surrogate pair
>>   3!:0 [ 9 u:  55357 56832
>> 262144  NB. Unicode4
>> 
>>    9 u:  128512  NB. Proper UTF-32 for 😀
>> 😀
>>   $ 9 u: 128512 NB. result is an atom with empty shape
>> 
>>   3!:0 [ 9 u:  128512  NB. Unicode4 type
>> 262144
>> 
>> So, why have surrogate pairs? That is where UTF-16 comes in. In order to
>> cover the entire codespace up to 16b10ffff by using at most two code units,
>> UTF-16 uses surrogate pairs, integers from 16bd800 to 16bdfff. The first
>> integer of the pair is in the range from 16bd800 to 16bdbff and the second
>> integer is in the range  from 16bdc00 to 16bdfff. This encoding scheme
>> provides confirmation that the surrogate pair is valid and allows a mapping
>> to the code points from 16b10000 to 16b10ffff that would not normally be
>> within reach of a single 16 bit code unit, but can be reached by two 16 bit
>> code units.
>> 
>>   3 u: 7 u: 16bffff   NB. Top of range for one 16 bit code unit
>> 65535
>>   3 u: 7 u: 16b10000  NB. Maps to surrogate pairs
>> 55296 56320
>>      7 u: 128512
>> 😀
>>   3 u: 7 u: 128512
>> 55357 56832
>>   3!:0 [ 7 u:  128512  NB. Unicode type
>> 131072
>> 
>> To complete the encoding options UTF-8 maps the code points with using one
>> to four 8 bit code units in a pretty clever way. If the code unit is
>> between 0 and 16b7f then the encoding uses only one code unit and this
>> establishes the use of 7-bit ASCII in UTF-8. If the code unit is between
>> 16bc2 and 16bdf then it is always a two code unit encoding and the second
>> code unit must be within the range of 16b80 and 16bbf. Three code unit
>> encodings are signalled by a first code unit from 16be0 to 16bef and four
>> code unit encodings always begin with a code unit between 16bf0 and 16bf4.
>> 
>>    8 u: 65
>> A
>>   3 u: 8 u: 65  NB. ASCII equivalent
>> 65
>>   3!:0 [ 8 u: 65  NB. literal
>> 2
>>   8 u: 295
>> ħ
>>   3 u: 8 u: 295
>> 196 167
>>   3!:0 [ 8 u: 295
>> 2  8 u: 3101
>> ఝ
>>   3 u: 8 u: 3101
>> 224 176 157
>>   3!:0 [ 8 u: 3101
>> 2
>>   8 u: 128512
>> 😀
>>  3 u: 8 u: 128512
>> 240 159 152 128
>>   3!:0 [ 8 u: 128512  NB. Literal type
>> 2
>> 
>> Again problems arise when interpreting surrogate pairs, although in this
>> case the result is an error and not interpreted the way unicode4 does.
>>   8 u: 55357 56832
>> ������
>>   3 u: 8 u: 55357 56832
>> 237 160 189 237 184 128
>>   3!:0 [ 8 u: 55357 56832
>> 2
>> 
>> So, where does this leave us? Well, we are kind of... sort of... doing
>> unicode, but in the process of making the process convenient, we have
>> drifted from the actual unicode standard.
>> 
>> I'll wait to hear your response before I revise the wiki, as there are a
>> number of ways to go with explaining this, ranging from 'does not conform
>> to unicode spec' to 'it is what it is'
>> 
>> Cheers, bob
>> 
>> I wrote some code that will mirror unicode spec more closely when
>> converting unicode code points to the different encodings and from the
>> different encodings back to unicode code points. It is not really that
>> complicated, aside from the checking for valid ranges of encoded results.
>> The references for my process can be found here on pages 125-127
>> https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf
>> 
>> 
>>   utf32 128512
>> 128512
>>   displayutf32 128512
>> 😀
>>   utf32 55357 56832  NB. invalid codepoints
>> |domain error: utf32
>> |       utf32 55357 56832
>> 
>>   utf32ucp 55357 56832 NB. invalid utf-32 encoding
>> |domain error: utf32
>> |       utf32ucp 55357 56832
>>   utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point
>> 128512
>> 
>>   utf16 128512
>> 55357 56832
>>   displayutf16 128512
>> 😀
>>   utf16 55357 56832  NB. still invalid codepoints
>> |rank error: utf16
>> |       utf16 55357 56832
>> 
>>   utf16ucp 128512  NB. invalid utf-16 encoding
>> |domain error: utf161
>> |       utf16ucp 128512
>>   utf16ucp 55357 56832  NB. valid utf-16 encoding returns unicode code
>> point
>> 128512
>> 
>>   utf8 128512
>> 240 159 152 128
>>   displayutf8 128512
>> 😀
>>   utf8 55357 56832  NB. still invalid codepoints
>> |rank error: utf8
>> |       utf8 55357 56832
>>   utf8ucp 128512  NB. invalid utf-8 encoding
>> |domain error: utf81
>> |       utf8ucp 128512
>>   utf8ucp 240 159 152 128  NB. valid utf-8 encoding returns unicode code
>> point
>> 128512
>> 
>> 
>> And here is the code (rough draft and could certainly be made more
>> readable) watch the word wrap on the longer lines.
>> 
>> utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0  NB. accepts code
>> points within code space returns utf-32
>> displayutf32 =: 9 u: utf32
>> utf32ucp =: utf32  NB.  accepts utf-32 code units - valid utf-32 code
>> units will return a valid code point
>> 
>> utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff &
>> I.) NB. accepts code points within code space returns utf-16
>> displayutf16 =: 7 u: utf16
>> utf161 =: [: ^: (16bd77f&<  +. 16be000&<: *. 16bfff&>:)
>> testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *.
>> 16bdfff&>:)@{:)
>> utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @.
>> testutf162
>> utf16ucp =: (utf161)`(utf162) @.( <:@#) NB.  accepts utf-16 code units -
>> valid utf-16 code units will return a valid code point
>> 
>> utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff
>> 16b10ffff & I.)
>> displayutf8 =: a. {~ utf8
>> utf81 =: [: ^: (16b7f&<)
>> testutf82 =:  ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:)
>> utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82
>> testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
>> *. 16bbf&>:)@{:)   +.   ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *.
>> 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bed&= )@{. *.
>> (16b80&<: *. 16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.
>> ((16bee&<: *. 16bef&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
>> 16bbf&>:)@{:)
>> utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @.
>> testutf83
>> testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
>> *. 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf1&<: *.
>> 16bf3&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
>> 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf4&= )@{. *.
>> (16b80&<: *. 16b8f&>:)@(1&{) *. (16b80&<: *. 16b8f&>:)@(2&{) *. (16b80&<:
>> *. 16b8f&>:)@{:)
>> utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{) 
>> ,_6&{.@#:@{:))
>> @. testutf84
>> utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB.  accepts utf-8
>> code units - valid utf-8 code units will return a valid code point
>> 
>> 
>>> On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming <
>> [email protected]> wrote:
>>> 
>>>>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote:
>>>>>> 
>>>>>> The introductory page for Unicode
>>>>>> 
>>>>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint
>>>>>> 
>>>>>> does not discuss 4-byte characters, or the concept of surrogate pairs
>> with 2-byte characters.
>>>>>> 
>>>>>> 4-byte precision is called unicode4 in NuVoc.  If someone would add
>> discussion of these to the page, they would be a Hero.  I'm just saying.
>>>>>> 
>>>>>> Henry Rich
>> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Writing help needed: surrogate pairs

Reply via email to