Bob, You are right about the range of valid unicode codepoint for characters.
The J document perhaps didn't place enough emphasis on the difference between unicode2/unicode4 J datatype and utf16/utf32 encoding. Putting it more explicitly, unicode2/unicode4 J datatype has nothing to do with utf16/utf32 unicode encoding. unicode2 is 16-bit wide literal and all bit patterns are always valid, similar for unicode4 . However this may make users more confusing 😂 automatic type promotion from byte to unicode2 to unicode4 always done atom by atom. On Sat, Sep 14, 2019, 1:08 AM 'robert therriault' via Programming < [email protected]> wrote: > Sorry to take so long to come up with information on the surrogate pairs, > Henry, but it is both more complicated and simpler than it all appears. > > First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a > gap from 16bd800 to 16bdfff (which is reserved for the surrogate pairs). > This means that the only valid codepoints that can represent characters are > 0 to 16bd7ff and 16be000 to 16b10ffff. The three different encoding schemes > for Unicode are UTF-8, UTF-16 and UTF-32 (the number indicating the number > of bits in the code unit). > > UTF-32 has enough bits to represent all of the Unicode code points as > single integers, but as mentioned above there are integers that do not > represent valid codepoints (greater than 1b10ffff or in the surrogate pair > gap) J's unicode4 seems to violate this by allowing surrogate pairs as > valid unicode4 > > 9 u: 55357 56832 NB. A surrogate pair > 😀 > $ 9 u: 55357 56832 NB. UTF-32 is always one integer > 2 > 3 u: 9 u: 55357 56832 > 55357 56832 NB. Keeps result as a surrogate pair > 3!:0 [ 9 u: 55357 56832 > 262144 NB. Unicode4 > > 9 u: 128512 NB. Proper UTF-32 for 😀 > 😀 > $ 9 u: 128512 NB. result is an atom with empty shape > > 3!:0 [ 9 u: 128512 NB. Unicode4 type > 262144 > > So, why have surrogate pairs? That is where UTF-16 comes in. In order to > cover the entire codespace up to 16b10ffff by using at most two code units, > UTF-16 uses surrogate pairs, integers from 16bd800 to 16bdfff. The first > integer of the pair is in the range from 16bd800 to 16bdbff and the second > integer is in the range from 16bdc00 to 16bdfff. This encoding scheme > provides confirmation that the surrogate pair is valid and allows a mapping > to the code points from 16b10000 to 16b10ffff that would not normally be > within reach of a single 16 bit code unit, but can be reached by two 16 bit > code units. > > 3 u: 7 u: 16bffff NB. Top of range for one 16 bit code unit > 65535 > 3 u: 7 u: 16b10000 NB. Maps to surrogate pairs > 55296 56320 > 7 u: 128512 > 😀 > 3 u: 7 u: 128512 > 55357 56832 > 3!:0 [ 7 u: 128512 NB. Unicode type > 131072 > > To complete the encoding options UTF-8 maps the code points with using one > to four 8 bit code units in a pretty clever way. If the code unit is > between 0 and 16b7f then the encoding uses only one code unit and this > establishes the use of 7-bit ASCII in UTF-8. If the code unit is between > 16bc2 and 16bdf then it is always a two code unit encoding and the second > code unit must be within the range of 16b80 and 16bbf. Three code unit > encodings are signalled by a first code unit from 16be0 to 16bef and four > code unit encodings always begin with a code unit between 16bf0 and 16bf4. > > 8 u: 65 > A > 3 u: 8 u: 65 NB. ASCII equivalent > 65 > 3!:0 [ 8 u: 65 NB. literal > 2 > 8 u: 295 > ħ > 3 u: 8 u: 295 > 196 167 > 3!:0 [ 8 u: 295 > 2 8 u: 3101 > ఝ > 3 u: 8 u: 3101 > 224 176 157 > 3!:0 [ 8 u: 3101 > 2 > 8 u: 128512 > 😀 > 3 u: 8 u: 128512 > 240 159 152 128 > 3!:0 [ 8 u: 128512 NB. Literal type > 2 > > Again problems arise when interpreting surrogate pairs, although in this > case the result is an error and not interpreted the way unicode4 does. > 8 u: 55357 56832 > ������ > 3 u: 8 u: 55357 56832 > 237 160 189 237 184 128 > 3!:0 [ 8 u: 55357 56832 > 2 > > So, where does this leave us? Well, we are kind of... sort of... doing > unicode, but in the process of making the process convenient, we have > drifted from the actual unicode standard. > > I'll wait to hear your response before I revise the wiki, as there are a > number of ways to go with explaining this, ranging from 'does not conform > to unicode spec' to 'it is what it is' > > Cheers, bob > > I wrote some code that will mirror unicode spec more closely when > converting unicode code points to the different encodings and from the > different encodings back to unicode code points. It is not really that > complicated, aside from the checking for valid ranges of encoded results. > The references for my process can be found here on pages 125-127 > https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf > > > utf32 128512 > 128512 > displayutf32 128512 > 😀 > utf32 55357 56832 NB. invalid codepoints > |domain error: utf32 > | utf32 55357 56832 > > utf32ucp 55357 56832 NB. invalid utf-32 encoding > |domain error: utf32 > | utf32ucp 55357 56832 > utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point > 128512 > > utf16 128512 > 55357 56832 > displayutf16 128512 > 😀 > utf16 55357 56832 NB. still invalid codepoints > |rank error: utf16 > | utf16 55357 56832 > > utf16ucp 128512 NB. invalid utf-16 encoding > |domain error: utf161 > | utf16ucp 128512 > utf16ucp 55357 56832 NB. valid utf-16 encoding returns unicode code > point > 128512 > > utf8 128512 > 240 159 152 128 > displayutf8 128512 > 😀 > utf8 55357 56832 NB. still invalid codepoints > |rank error: utf8 > | utf8 55357 56832 > utf8ucp 128512 NB. invalid utf-8 encoding > |domain error: utf81 > | utf8ucp 128512 > utf8ucp 240 159 152 128 NB. valid utf-8 encoding returns unicode code > point > 128512 > > > And here is the code (rough draft and could certainly be made more > readable) watch the word wrap on the longer lines. > > utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0 NB. accepts code > points within code space returns utf-32 > displayutf32 =: 9 u: utf32 > utf32ucp =: utf32 NB. accepts utf-32 code units - valid utf-32 code > units will return a valid code point > > utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff & > I.) NB. accepts code points within code space returns utf-16 > displayutf16 =: 7 u: utf16 > utf161 =: [: ^: (16bd77f&< +. 16be000&<: *. 16bfff&>:) > testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *. > 16bdfff&>:)@{:) > utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @. > testutf162 > utf16ucp =: (utf161)`(utf162) @.( <:@#) NB. accepts utf-16 code units - > valid utf-16 code units will return a valid code point > > utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff > 16b10ffff & I.) > displayutf8 =: a. {~ utf8 > utf81 =: [: ^: (16b7f&<) > testutf82 =: ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:) > utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82 > testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<: > *. 16bbf&>:)@{:) +. ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *. > 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bed&= )@{. *. > (16b80&<: *. 16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:) +. > ((16bee&<: *. 16bef&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. > 16bbf&>:)@{:) > utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @. > testutf83 > testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<: > *. 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bf1&<: *. > 16bf3&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. > 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bf4&= )@{. *. > (16b80&<: *. 16b8f&>:)@(1&{) *. (16b80&<: *. 16b8f&>:)@(2&{) *. (16b80&<: > *. 16b8f&>:)@{:) > utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{) > ,_6&{.@#:@{:)) > @. testutf84 > utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB. accepts utf-8 > code units - valid utf-8 code units will return a valid code point > > > > On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming < > [email protected]> wrote: > > > >>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote: > >>>> > >>>> The introductory page for Unicode > >>>> > >>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint > >>>> > >>>> does not discuss 4-byte characters, or the concept of surrogate pairs > with 2-byte characters. > >>>> > >>>> 4-byte precision is called unicode4 in NuVoc. If someone would add > discussion of these to the page, they would be a Hero. I'm just saying. > >>>> > >>>> Henry Rich > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
