Sounds good, Henry I will get to it.
Cheers, bob > On Sep 13, 2019, at 10:59 AM, Henry Rich <[email protected]> wrote: > > Not being a user of this I would not be willing to make a suggestion to > change anything. Bill Lam uses it every day; perhaps he has an opinion. I > think actual practice may have more variation than the standard allows. > > For the wiki, I would be content with a description of what surrogate pairs > are for and when they arise (especially when converting Unicode code points > above U+FFFF to UTF-8 and UTF-16). > > Detail is great, but put it towards the end of the page if possible. > > Henry Rich > > On 9/13/2019 1:08 PM, 'robert therriault' via Programming wrote: >> Sorry to take so long to come up with information on the surrogate pairs, >> Henry, but it is both more complicated and simpler than it all appears. >> >> First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a gap >> from 16bd800 to 16bdfff (which is reserved for the surrogate pairs). This >> means that the only valid codepoints that can represent characters are 0 to >> 16bd7ff and 16be000 to 16b10ffff. The three different encoding schemes for >> Unicode are UTF-8, UTF-16 and UTF-32 (the number indicating the number of >> bits in the code unit). >> >> UTF-32 has enough bits to represent all of the Unicode code points as single >> integers, but as mentioned above there are integers that do not represent >> valid codepoints (greater than 1b10ffff or in the surrogate pair gap) J's >> unicode4 seems to violate this by allowing surrogate pairs as valid unicode4 >> 9 u: 55357 56832 NB. A surrogate pair >> 😀 >> $ 9 u: 55357 56832 NB. UTF-32 is always one integer >> 2 >> 3 u: 9 u: 55357 56832 >> 55357 56832 NB. Keeps result as a surrogate pair >> 3!:0 [ 9 u: 55357 56832 >> 262144 NB. Unicode4 >> >> 9 u: 128512 NB. Proper UTF-32 for 😀 >> 😀 >> $ 9 u: 128512 NB. result is an atom with empty shape >> >> 3!:0 [ 9 u: 128512 NB. Unicode4 type >> 262144 >> >> So, why have surrogate pairs? That is where UTF-16 comes in. In order to >> cover the entire codespace up to 16b10ffff by using at most two code units, >> UTF-16 uses surrogate pairs, integers from 16bd800 to 16bdfff. The first >> integer of the pair is in the range from 16bd800 to 16bdbff and the second >> integer is in the range from 16bdc00 to 16bdfff. This encoding scheme >> provides confirmation that the surrogate pair is valid and allows a mapping >> to the code points from 16b10000 to 16b10ffff that would not normally be >> within reach of a single 16 bit code unit, but can be reached by two 16 bit >> code units. >> >> 3 u: 7 u: 16bffff NB. Top of range for one 16 bit code unit >> 65535 >> 3 u: 7 u: 16b10000 NB. Maps to surrogate pairs >> 55296 56320 >> 7 u: 128512 >> 😀 >> 3 u: 7 u: 128512 >> 55357 56832 >> 3!:0 [ 7 u: 128512 NB. Unicode type >> 131072 >> >> To complete the encoding options UTF-8 maps the code points with using one >> to four 8 bit code units in a pretty clever way. If the code unit is between >> 0 and 16b7f then the encoding uses only one code unit and this establishes >> the use of 7-bit ASCII in UTF-8. If the code unit is between 16bc2 and 16bdf >> then it is always a two code unit encoding and the second code unit must be >> within the range of 16b80 and 16bbf. Three code unit encodings are signalled >> by a first code unit from 16be0 to 16bef and four code unit encodings always >> begin with a code unit between 16bf0 and 16bf4. >> >> 8 u: 65 >> A >> 3 u: 8 u: 65 NB. ASCII equivalent >> 65 >> 3!:0 [ 8 u: 65 NB. literal >> 2 >> 8 u: 295 >> ħ >> 3 u: 8 u: 295 >> 196 167 >> 3!:0 [ 8 u: 295 >> 2 8 u: 3101 >> ఝ >> 3 u: 8 u: 3101 >> 224 176 157 >> 3!:0 [ 8 u: 3101 >> 2 >> 8 u: 128512 >> 😀 >> 3 u: 8 u: 128512 >> 240 159 152 128 >> 3!:0 [ 8 u: 128512 NB. Literal type >> 2 >> >> Again problems arise when interpreting surrogate pairs, although in this >> case the result is an error and not interpreted the way unicode4 does. >> 8 u: 55357 56832 >> ������ >> 3 u: 8 u: 55357 56832 >> 237 160 189 237 184 128 >> 3!:0 [ 8 u: 55357 56832 >> 2 >> >> So, where does this leave us? Well, we are kind of... sort of... doing >> unicode, but in the process of making the process convenient, we have >> drifted from the actual unicode standard. >> >> I'll wait to hear your response before I revise the wiki, as there are a >> number of ways to go with explaining this, ranging from 'does not conform to >> unicode spec' to 'it is what it is' >> >> Cheers, bob >> >> I wrote some code that will mirror unicode spec more closely when converting >> unicode code points to the different encodings and from the different >> encodings back to unicode code points. It is not really that complicated, >> aside from the checking for valid ranges of encoded results. The references >> for my process can be found here on pages 125-127 >> https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf >> >> >> utf32 128512 >> 128512 >> displayutf32 128512 >> 😀 >> utf32 55357 56832 NB. invalid codepoints >> |domain error: utf32 >> | utf32 55357 56832 >> >> utf32ucp 55357 56832 NB. invalid utf-32 encoding >> |domain error: utf32 >> | utf32ucp 55357 56832 >> utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point >> 128512 >> >> utf16 128512 >> 55357 56832 >> displayutf16 128512 >> 😀 >> utf16 55357 56832 NB. still invalid codepoints >> |rank error: utf16 >> | utf16 55357 56832 >> >> utf16ucp 128512 NB. invalid utf-16 encoding >> |domain error: utf161 >> | utf16ucp 128512 >> utf16ucp 55357 56832 NB. valid utf-16 encoding returns unicode code point >> 128512 >> >> utf8 128512 >> 240 159 152 128 >> displayutf8 128512 >> 😀 >> utf8 55357 56832 NB. still invalid codepoints >> |rank error: utf8 >> | utf8 55357 56832 >> utf8ucp 128512 NB. invalid utf-8 encoding >> |domain error: utf81 >> | utf8ucp 128512 >> utf8ucp 240 159 152 128 NB. valid utf-8 encoding returns unicode code >> point >> 128512 >> >> >> And here is the code (rough draft and could certainly be made more readable) >> watch the word wrap on the longer lines. >> >> utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0 NB. accepts code >> points within code space returns utf-32 >> displayutf32 =: 9 u: utf32 >> utf32ucp =: utf32 NB. accepts utf-32 code units - valid utf-32 code units >> will return a valid code point >> >> utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff & >> I.) NB. accepts code points within code space returns utf-16 >> displayutf16 =: 7 u: utf16 >> utf161 =: [: ^: (16bd77f&< +. 16be000&<: *. 16bfff&>:) >> testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *. 16bdfff&>:)@{:) >> utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @. >> testutf162 >> utf16ucp =: (utf161)`(utf162) @.( <:@#) NB. accepts utf-16 code units - >> valid utf-16 code units will return a valid code point >> >> utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff >> 16b10ffff & I.) >> displayutf8 =: a. {~ utf8 >> utf81 =: [: ^: (16b7f&<) >> testutf82 =: ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:) >> utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82 >> testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. >> 16bbf&>:)@{:) +. ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *. >> 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bed&= )@{. *. >> (16b80&<: *. 16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bee&<: >> *. 16bef&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:) >> utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @. testutf83 >> testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. >> 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bf1&<: *. >> 16bf3&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@(2&{) >> *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bf4&= )@{. *. (16b80&<: *. >> 16b8f&>:)@(1&{) *. (16b80&<: *. 16b8f&>:)@(2&{) *. (16b80&<: *. 16b8f&>:)@{:) >> utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{) >> ,_6&{.@#:@{:)) @. testutf84 >> utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB. accepts utf-8 code >> units - valid utf-8 code units will return a valid code point >> >> >>> On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming >>> <[email protected]> wrote: >>> >>>>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote: >>>>>> >>>>>> The introductory page for Unicode >>>>>> >>>>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint >>>>>> >>>>>> does not discuss 4-byte characters, or the concept of surrogate pairs >>>>>> with 2-byte characters. >>>>>> >>>>>> 4-byte precision is called unicode4 in NuVoc. If someone would add >>>>>> discussion of these to the page, they would be a Hero. I'm just saying. >>>>>> >>>>>> Henry Rich >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > > > --- > This email has been checked for viruses by AVG. > https://www.avg.com > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
