Re: [Jprogramming] Writing help needed: surrogate pairs

'robert therriault' via Programming Fri, 13 Sep 2019 11:04:16 -0700

Sounds good, Henry

I will get to it.


Cheers, bob

> On Sep 13, 2019, at 10:59 AM, Henry Rich <[email protected]> wrote:
> 
> Not being a user of this I would not be willing to make a suggestion to 
> change anything.  Bill Lam uses it every day; perhaps he has an opinion.  I 
> think actual practice may have more variation than the standard allows.
> 
> For the wiki, I would be content with a description of what surrogate pairs 
> are for and when they arise (especially when converting Unicode code points 
> above U+FFFF to UTF-8 and UTF-16).
> 
> Detail is great, but put it towards the end of the page if possible.
> 
> Henry Rich
> 
> On 9/13/2019 1:08 PM, 'robert therriault' via Programming wrote:
>> Sorry to take so long to come up with information on the surrogate pairs, 
>> Henry, but it is both more complicated and simpler than it all appears.
>> 
>> First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a gap 
>> from 16bd800 to 16bdfff (which is reserved for the surrogate pairs). This 
>> means that the only valid codepoints that can represent characters are 0 to 
>> 16bd7ff and 16be000 to 16b10ffff. The three different encoding schemes for 
>> Unicode are UTF-8, UTF-16 and UTF-32 (the number indicating the number of 
>> bits in the code unit).
>> 
>> UTF-32 has enough bits to represent all of the Unicode code points as single 
>> integers, but as mentioned above there are integers that do not represent 
>> valid codepoints (greater than 1b10ffff or in the surrogate pair gap) J's 
>> unicode4 seems to violate this by allowing surrogate pairs as valid unicode4
>>             9 u:  55357 56832 NB. A surrogate pair
>> 😀
>>    $ 9 u: 55357 56832  NB. UTF-32 is always one integer
>> 2
>>    3 u: 9 u: 55357 56832
>> 55357 56832  NB. Keeps result as a surrogate pair
>>    3!:0 [ 9 u:  55357 56832
>> 262144  NB. Unicode4
>> 
>>     9 u:  128512  NB. Proper UTF-32 for 😀
>> 😀
>>    $ 9 u: 128512 NB. result is an atom with empty shape
>> 
>>    3!:0 [ 9 u:  128512  NB. Unicode4 type
>> 262144
>> 
>> So, why have surrogate pairs? That is where UTF-16 comes in. In order to 
>> cover the entire codespace up to 16b10ffff by using at most two code units, 
>> UTF-16 uses surrogate pairs, integers from 16bd800 to 16bdfff. The first 
>> integer of the pair is in the range from 16bd800 to 16bdbff and the second 
>> integer is in the range  from 16bdc00 to 16bdfff. This encoding scheme 
>> provides confirmation that the surrogate pair is valid and allows a mapping 
>> to the code points from 16b10000 to 16b10ffff that would not normally be 
>> within reach of a single 16 bit code unit, but can be reached by two 16 bit 
>> code units.
>> 
>>    3 u: 7 u: 16bffff   NB. Top of range for one 16 bit code unit
>> 65535
>>    3 u: 7 u: 16b10000  NB. Maps to surrogate pairs
>> 55296 56320
>>       7 u: 128512
>> 😀
>>    3 u: 7 u: 128512
>> 55357 56832
>>    3!:0 [ 7 u:  128512  NB. Unicode type
>> 131072
>> 
>> To complete the encoding options UTF-8 maps the code points with using one 
>> to four 8 bit code units in a pretty clever way. If the code unit is between 
>> 0 and 16b7f then the encoding uses only one code unit and this establishes 
>> the use of 7-bit ASCII in UTF-8. If the code unit is between 16bc2 and 16bdf 
>> then it is always a two code unit encoding and the second code unit must be 
>> within the range of 16b80 and 16bbf. Three code unit encodings are signalled 
>> by a first code unit from 16be0 to 16bef and four code unit encodings always 
>> begin with a code unit between 16bf0 and 16bf4.
>> 
>>     8 u: 65
>> A
>>    3 u: 8 u: 65  NB. ASCII equivalent
>> 65
>>    3!:0 [ 8 u: 65  NB. literal
>> 2
>>    8 u: 295
>> ħ
>>    3 u: 8 u: 295
>> 196 167
>>    3!:0 [ 8 u: 295
>> 2  8 u: 3101
>> ఝ
>>    3 u: 8 u: 3101
>> 224 176 157
>>    3!:0 [ 8 u: 3101
>> 2
>>    8 u: 128512
>> 😀
>>   3 u: 8 u: 128512
>> 240 159 152 128
>>    3!:0 [ 8 u: 128512  NB. Literal type
>> 2
>> 
>> Again problems arise when interpreting surrogate pairs, although in this 
>> case the result is an error and not interpreted the way unicode4 does.
>>    8 u: 55357 56832
>> ������
>>    3 u: 8 u: 55357 56832
>> 237 160 189 237 184 128
>>    3!:0 [ 8 u: 55357 56832
>> 2
>> 
>> So, where does this leave us? Well, we are kind of... sort of... doing 
>> unicode, but in the process of making the process convenient, we have 
>> drifted from the actual unicode standard.
>> 
>> I'll wait to hear your response before I revise the wiki, as there are a 
>> number of ways to go with explaining this, ranging from 'does not conform to 
>> unicode spec' to 'it is what it is'
>> 
>> Cheers, bob
>> 
>> I wrote some code that will mirror unicode spec more closely when converting 
>> unicode code points to the different encodings and from the different 
>> encodings back to unicode code points. It is not really that complicated, 
>> aside from the checking for valid ranges of encoded results. The references 
>> for my process can be found here on pages 125-127 
>> https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf
>> 
>> 
>>    utf32 128512
>> 128512
>>    displayutf32 128512
>> 😀
>>    utf32 55357 56832  NB. invalid codepoints
>> |domain error: utf32
>> |       utf32 55357 56832
>> 
>>    utf32ucp 55357 56832 NB. invalid utf-32 encoding
>> |domain error: utf32
>> |       utf32ucp 55357 56832
>>    utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point
>> 128512
>> 
>>    utf16 128512
>> 55357 56832
>>    displayutf16 128512
>> 😀
>>    utf16 55357 56832  NB. still invalid codepoints
>> |rank error: utf16
>> |       utf16 55357 56832
>> 
>>    utf16ucp 128512  NB. invalid utf-16 encoding
>> |domain error: utf161
>> |       utf16ucp 128512
>>    utf16ucp 55357 56832  NB. valid utf-16 encoding returns unicode code point
>> 128512
>> 
>>    utf8 128512
>> 240 159 152 128
>>    displayutf8 128512
>> 😀
>>    utf8 55357 56832  NB. still invalid codepoints
>> |rank error: utf8
>> |       utf8 55357 56832
>>    utf8ucp 128512  NB. invalid utf-8 encoding
>> |domain error: utf81
>> |       utf8ucp 128512
>>    utf8ucp 240 159 152 128  NB. valid utf-8 encoding returns unicode code 
>> point
>> 128512
>> 
>> 
>> And here is the code (rough draft and could certainly be made more readable) 
>> watch the word wrap on the longer lines.
>> 
>> utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0  NB. accepts code 
>> points within code space returns utf-32
>> displayutf32 =: 9 u: utf32
>> utf32ucp =: utf32  NB.  accepts utf-32 code units - valid utf-32 code units 
>> will return a valid code point
>> 
>> utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff & 
>> I.) NB. accepts code points within code space returns utf-16
>> displayutf16 =: 7 u: utf16
>> utf161 =: [: ^: (16bd77f&<  +. 16be000&<: *. 16bfff&>:)
>> testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *. 16bdfff&>:)@{:)
>> utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @. 
>> testutf162
>> utf16ucp =: (utf161)`(utf162) @.( <:@#) NB.  accepts utf-16 code units - 
>> valid utf-16 code units will return a valid code point
>> 
>> utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff 
>> 16b10ffff & I.)
>> displayutf8 =: a. {~ utf8
>> utf81 =: [: ^: (16b7f&<)
>> testutf82 =:  ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:)
>> utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82
>> testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 
>> 16bbf&>:)@{:)   +.   ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *. 
>> 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bed&= )@{. *. 
>> (16b80&<: *. 16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bee&<: 
>> *. 16bef&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)
>> utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @. testutf83
>> testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 
>> 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf1&<: *. 
>> 16bf3&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@(2&{) 
>> *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf4&= )@{. *. (16b80&<: *. 
>> 16b8f&>:)@(1&{) *. (16b80&<: *. 16b8f&>:)@(2&{) *. (16b80&<: *. 16b8f&>:)@{:)
>> utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{) 
>> ,_6&{.@#:@{:)) @. testutf84
>> utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB.  accepts utf-8 code 
>> units - valid utf-8 code units will return a valid code point
>> 
>> 
>>> On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming 
>>> <[email protected]> wrote:
>>> 
>>>>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote:
>>>>>> 
>>>>>> The introductory page for Unicode
>>>>>> 
>>>>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint
>>>>>> 
>>>>>> does not discuss 4-byte characters, or the concept of surrogate pairs 
>>>>>> with 2-byte characters.
>>>>>> 
>>>>>> 4-byte precision is called unicode4 in NuVoc.  If someone would add 
>>>>>> discussion of these to the page, they would be a Hero.  I'm just saying.
>>>>>> 
>>>>>> Henry Rich
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> 
> 
> ---
> This email has been checked for viruses by AVG.
> https://www.avg.com
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Writing help needed: surrogate pairs

Reply via email to