Re: [Jprogramming] Writing help needed: surrogate pairs

'robert therriault' via Programming Fri, 13 Sep 2019 10:09:29 -0700

Sorry to take so long to come up with information on the surrogate pairs, 
Henry, but it is both more complicated and simpler than it all appears.

First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a gap 
from 16bd800 to 16bdfff (which is reserved for the surrogate pairs). This means 
that the only valid codepoints that can represent characters are 0 to 16bd7ff 
and 16be000 to 16b10ffff. The three different encoding schemes for Unicode are 
UTF-8, UTF-16 and UTF-32 (the number indicating the number of bits in the code 
unit). 

UTF-32 has enough bits to represent all of the Unicode code points as single 
integers, but as mentioned above there are integers that do not represent valid 
codepoints (greater than 1b10ffff or in the surrogate pair gap) J's unicode4 
seems to violate this by allowing surrogate pairs as valid unicode4

   9 u:  55357 56832 NB. A surrogate pair
😀
   $ 9 u: 55357 56832  NB. UTF-32 is always one integer
2
   3 u: 9 u: 55357 56832
55357 56832  NB. Keeps result as a surrogate pair
   3!:0 [ 9 u:  55357 56832
262144  NB. Unicode4

    9 u:  128512  NB. Proper UTF-32 for 😀
😀
   $ 9 u: 128512 NB. result is an atom with empty shape

   3!:0 [ 9 u:  128512  NB. Unicode4 type
262144

So, why have surrogate pairs? That is where UTF-16 comes in. In order to cover 
the entire codespace up to 16b10ffff by using at most two code units, UTF-16 
uses surrogate pairs, integers from 16bd800 to 16bdfff. The first integer of 
the pair is in the range from 16bd800 to 16bdbff and the second integer is in 
the range  from 16bdc00 to 16bdfff. This encoding scheme provides confirmation 
that the surrogate pair is valid and allows a mapping to the code points from 
16b10000 to 16b10ffff that would not normally be within reach of a single 16 
bit code unit, but can be reached by two 16 bit code units.

   3 u: 7 u: 16bffff   NB. Top of range for one 16 bit code unit
65535
   3 u: 7 u: 16b10000  NB. Maps to surrogate pairs
55296 56320
      7 u: 128512
😀
   3 u: 7 u: 128512
55357 56832
   3!:0 [ 7 u:  128512  NB. Unicode type
131072

To complete the encoding options UTF-8 maps the code points with using one to 
four 8 bit code units in a pretty clever way. If the code unit is between 0 and 
16b7f then the encoding uses only one code unit and this establishes the use of 
7-bit ASCII in UTF-8. If the code unit is between 16bc2 and 16bdf then it is 
always a two code unit encoding and the second code unit must be within the 
range of 16b80 and 16bbf. Three code unit encodings are signalled by a first 
code unit from 16be0 to 16bef and four code unit encodings always begin with a 
code unit between 16bf0 and 16bf4.

    8 u: 65
A
   3 u: 8 u: 65  NB. ASCII equivalent 
65
   3!:0 [ 8 u: 65  NB. literal
2
   8 u: 295
ħ
   3 u: 8 u: 295
196 167
   3!:0 [ 8 u: 295
2  8 u: 3101
ఝ
   3 u: 8 u: 3101
224 176 157
   3!:0 [ 8 u: 3101
2
   8 u: 128512
😀
  3 u: 8 u: 128512
240 159 152 128
   3!:0 [ 8 u: 128512  NB. Literal type
2

Again problems arise when interpreting surrogate pairs, although in this case 
the result is an error and not interpreted the way unicode4 does.
   8 u: 55357 56832
������
   3 u: 8 u: 55357 56832
237 160 189 237 184 128
   3!:0 [ 8 u: 55357 56832
2

So, where does this leave us? Well, we are kind of... sort of... doing unicode, 
but in the process of making the process convenient, we have drifted from the 
actual unicode standard. 

I'll wait to hear your response before I revise the wiki, as there are a number 
of ways to go with explaining this, ranging from 'does not conform to unicode 
spec' to 'it is what it is'

Cheers, bob

I wrote some code that will mirror unicode spec more closely when converting 
unicode code points to the different encodings and from the different encodings 
back to unicode code points. It is not really that complicated, aside from the 
checking for valid ranges of encoded results. The references for my process can 
be found here on pages 125-127 
https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf

   utf32 128512
128512
   displayutf32 128512
😀
   utf32 55357 56832  NB. invalid codepoints
|domain error: utf32
|       utf32 55357 56832

   utf32ucp 55357 56832 NB. invalid utf-32 encoding
|domain error: utf32
|       utf32ucp 55357 56832
   utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point
128512

   utf16 128512
55357 56832
   displayutf16 128512
😀
   utf16 55357 56832  NB. still invalid codepoints
|rank error: utf16
|       utf16 55357 56832

   utf16ucp 128512  NB. invalid utf-16 encoding
|domain error: utf161
|       utf16ucp 128512
   utf16ucp 55357 56832  NB. valid utf-16 encoding returns unicode code point
128512

   utf8 128512
240 159 152 128
   displayutf8 128512
😀
   utf8 55357 56832  NB. still invalid codepoints
|rank error: utf8
|       utf8 55357 56832
   utf8ucp 128512  NB. invalid utf-8 encoding
|domain error: utf81
|       utf8ucp 128512
   utf8ucp 240 159 152 128  NB. valid utf-8 encoding returns unicode code point
128512

And here is the code (rough draft and could certainly be made more readable) 
watch the word wrap on the longer lines.

utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0  NB. accepts code 
points within code space returns utf-32
displayutf32 =: 9 u: utf32
utf32ucp =: utf32  NB.  accepts utf-32 code units - valid utf-32 code units 
will return a valid code point

utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff & I.) 
NB. accepts code points within code space returns utf-16
displayutf16 =: 7 u: utf16
utf161 =: [: ^: (16bd77f&<  +. 16be000&<: *. 16bfff&>:)
testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *. 16bdfff&>:)@{:)
utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @. 
testutf162
utf16ucp =: (utf161)`(utf162) @.( <:@#) NB.  accepts utf-16 code units - valid 
utf-16 code units will return a valid code point

utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff 16b10ffff & 
I.)
displayutf8 =: a. {~ utf8
utf81 =: [: ^: (16b7f&<)
testutf82 =:  ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:)
utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82
testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 
16bbf&>:)@{:)   +.   ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) 
*. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bed&= )@{. *. (16b80&<: *. 
16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bee&<: *. 16bef&>:)@{. 
*. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)
utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @. testutf83
testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 
16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf1&<: *. 16bf3&>:)@{. 
*. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@(2&{) *. (16b80&<: *. 
16bbf&>:)@{:)   +.   ((16bf4&= )@{. *. (16b80&<: *. 16b8f&>:)@(1&{) *. 
(16b80&<: *. 16b8f&>:)@(2&{) *. (16b80&<: *. 16b8f&>:)@{:)   
utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{) 
,_6&{.@#:@{:)) @. testutf84
utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB.  accepts utf-8 code 
units - valid utf-8 code units will return a valid code point

> On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming 
> <[email protected]> wrote:
> 
>>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote:
>>>> 
>>>> The introductory page for Unicode
>>>> 
>>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint
>>>> 
>>>> does not discuss 4-byte characters, or the concept of surrogate pairs with 
>>>> 2-byte characters.
>>>> 
>>>> 4-byte precision is called unicode4 in NuVoc.  If someone would add 
>>>> discussion of these to the page, they would be a Hero.  I'm just saying.
>>>> 
>>>> Henry Rich

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Writing help needed: surrogate pairs

Reply via email to