Sorry to take so long to come up with information on the surrogate pairs,
Henry, but it is both more complicated and simpler than it all appears.
First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a gap
from 16bd800 to 16bdfff (which is reserved for the surrogate pairs). This means
that the only valid codepoints that can represent characters are 0 to 16bd7ff
and 16be000 to 16b10ffff. The three different encoding schemes for Unicode are
UTF-8, UTF-16 and UTF-32 (the number indicating the number of bits in the code
unit).
UTF-32 has enough bits to represent all of the Unicode code points as single
integers, but as mentioned above there are integers that do not represent valid
codepoints (greater than 1b10ffff or in the surrogate pair gap) J's unicode4
seems to violate this by allowing surrogate pairs as valid unicode4
9 u: 55357 56832 NB. A surrogate pair
😀
$ 9 u: 55357 56832 NB. UTF-32 is always one integer
2
3 u: 9 u: 55357 56832
55357 56832 NB. Keeps result as a surrogate pair
3!:0 [ 9 u: 55357 56832
262144 NB. Unicode4
9 u: 128512 NB. Proper UTF-32 for 😀
😀
$ 9 u: 128512 NB. result is an atom with empty shape
3!:0 [ 9 u: 128512 NB. Unicode4 type
262144
So, why have surrogate pairs? That is where UTF-16 comes in. In order to cover
the entire codespace up to 16b10ffff by using at most two code units, UTF-16
uses surrogate pairs, integers from 16bd800 to 16bdfff. The first integer of
the pair is in the range from 16bd800 to 16bdbff and the second integer is in
the range from 16bdc00 to 16bdfff. This encoding scheme provides confirmation
that the surrogate pair is valid and allows a mapping to the code points from
16b10000 to 16b10ffff that would not normally be within reach of a single 16
bit code unit, but can be reached by two 16 bit code units.
3 u: 7 u: 16bffff NB. Top of range for one 16 bit code unit
65535
3 u: 7 u: 16b10000 NB. Maps to surrogate pairs
55296 56320
7 u: 128512
😀
3 u: 7 u: 128512
55357 56832
3!:0 [ 7 u: 128512 NB. Unicode type
131072
To complete the encoding options UTF-8 maps the code points with using one to
four 8 bit code units in a pretty clever way. If the code unit is between 0 and
16b7f then the encoding uses only one code unit and this establishes the use of
7-bit ASCII in UTF-8. If the code unit is between 16bc2 and 16bdf then it is
always a two code unit encoding and the second code unit must be within the
range of 16b80 and 16bbf. Three code unit encodings are signalled by a first
code unit from 16be0 to 16bef and four code unit encodings always begin with a
code unit between 16bf0 and 16bf4.
8 u: 65
A
3 u: 8 u: 65 NB. ASCII equivalent
65
3!:0 [ 8 u: 65 NB. literal
2
8 u: 295
ħ
3 u: 8 u: 295
196 167
3!:0 [ 8 u: 295
2 8 u: 3101
ఝ
3 u: 8 u: 3101
224 176 157
3!:0 [ 8 u: 3101
2
8 u: 128512
😀
3 u: 8 u: 128512
240 159 152 128
3!:0 [ 8 u: 128512 NB. Literal type
2
Again problems arise when interpreting surrogate pairs, although in this case
the result is an error and not interpreted the way unicode4 does.
8 u: 55357 56832
������
3 u: 8 u: 55357 56832
237 160 189 237 184 128
3!:0 [ 8 u: 55357 56832
2
So, where does this leave us? Well, we are kind of... sort of... doing unicode,
but in the process of making the process convenient, we have drifted from the
actual unicode standard.
I'll wait to hear your response before I revise the wiki, as there are a number
of ways to go with explaining this, ranging from 'does not conform to unicode
spec' to 'it is what it is'
Cheers, bob
I wrote some code that will mirror unicode spec more closely when converting
unicode code points to the different encodings and from the different encodings
back to unicode code points. It is not really that complicated, aside from the
checking for valid ranges of encoded results. The references for my process can
be found here on pages 125-127
https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf
utf32 128512
128512
displayutf32 128512
😀
utf32 55357 56832 NB. invalid codepoints
|domain error: utf32
| utf32 55357 56832
utf32ucp 55357 56832 NB. invalid utf-32 encoding
|domain error: utf32
| utf32ucp 55357 56832
utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point
128512
utf16 128512
55357 56832
displayutf16 128512
😀
utf16 55357 56832 NB. still invalid codepoints
|rank error: utf16
| utf16 55357 56832
utf16ucp 128512 NB. invalid utf-16 encoding
|domain error: utf161
| utf16ucp 128512
utf16ucp 55357 56832 NB. valid utf-16 encoding returns unicode code point
128512
utf8 128512
240 159 152 128
displayutf8 128512
😀
utf8 55357 56832 NB. still invalid codepoints
|rank error: utf8
| utf8 55357 56832
utf8ucp 128512 NB. invalid utf-8 encoding
|domain error: utf81
| utf8ucp 128512
utf8ucp 240 159 152 128 NB. valid utf-8 encoding returns unicode code point
128512
And here is the code (rough draft and could certainly be made more readable)
watch the word wrap on the longer lines.
utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0 NB. accepts code
points within code space returns utf-32
displayutf32 =: 9 u: utf32
utf32ucp =: utf32 NB. accepts utf-32 code units - valid utf-32 code units
will return a valid code point
utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff & I.)
NB. accepts code points within code space returns utf-16
displayutf16 =: 7 u: utf16
utf161 =: [: ^: (16bd77f&< +. 16be000&<: *. 16bfff&>:)
testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *. 16bdfff&>:)@{:)
utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @.
testutf162
utf16ucp =: (utf161)`(utf162) @.( <:@#) NB. accepts utf-16 code units - valid
utf-16 code units will return a valid code point
utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff 16b10ffff &
I.)
displayutf8 =: a. {~ utf8
utf81 =: [: ^: (16b7f&<)
testutf82 =: ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:)
utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82
testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
16bbf&>:)@{:) +. ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{)
*. (16b80&<: *. 16bbf&>:)@{:) +. ((16bed&= )@{. *. (16b80&<: *.
16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bee&<: *. 16bef&>:)@{.
*. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)
utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @. testutf83
testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bf1&<: *. 16bf3&>:)@{.
*. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@(2&{) *. (16b80&<: *.
16bbf&>:)@{:) +. ((16bf4&= )@{. *. (16b80&<: *. 16b8f&>:)@(1&{) *.
(16b80&<: *. 16b8f&>:)@(2&{) *. (16b80&<: *. 16b8f&>:)@{:)
utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{)
,_6&{.@#:@{:)) @. testutf84
utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB. accepts utf-8 code
units - valid utf-8 code units will return a valid code point
> On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming
> <[email protected]> wrote:
>
>>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote:
>>>>
>>>> The introductory page for Unicode
>>>>
>>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint
>>>>
>>>> does not discuss 4-byte characters, or the concept of surrogate pairs with
>>>> 2-byte characters.
>>>>
>>>> 4-byte precision is called unicode4 in NuVoc. If someone would add
>>>> discussion of these to the page, they would be a Hero. I'm just saying.
>>>>
>>>> Henry Rich
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm