Thanks Bill,
I think that I can still safely describe the use of surrogate pairs in unicode
and unicode4 though, because it seems that those encodings do follow the utf-16
approach, right?
Also, it seems to me that the byte encoding does follow the utf-8 approach,
where the lower 127 ASCII characters are just a special case of utf-8 encoding,
or is that the wrong way to look at that.
8 u: 65
A
3 u: 8 u: 65
65
65 { a.
A
8 u: 162
¢
3 u: 8 u: 162
194 162
162 { a.
�
194 162 { a.
¢
Cheers, bob
> On Sep 13, 2019, at 3:42 PM, bill lam <[email protected]> wrote:
>
> Bob,
>
> You are right about the range of valid unicode codepoint for characters.
>
> The J document perhaps didn't place enough emphasis on the difference
> between unicode2/unicode4 J datatype and utf16/utf32 encoding.
>
> Putting it more explicitly, unicode2/unicode4 J datatype has nothing to do
> with utf16/utf32 unicode encoding. unicode2 is 16-bit wide literal and all
> bit patterns are always valid, similar for unicode4 . However this may make
> users more confusing 😂
>
> automatic type promotion from byte to unicode2 to unicode4 always done atom
> by atom.
>
>
> On Sat, Sep 14, 2019, 1:08 AM 'robert therriault' via Programming <
> [email protected]> wrote:
>
>> Sorry to take so long to come up with information on the surrogate pairs,
>> Henry, but it is both more complicated and simpler than it all appears.
>>
>> First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a
>> gap from 16bd800 to 16bdfff (which is reserved for the surrogate pairs).
>> This means that the only valid codepoints that can represent characters are
>> 0 to 16bd7ff and 16be000 to 16b10ffff. The three different encoding schemes
>> for Unicode are UTF-8, UTF-16 and UTF-32 (the number indicating the number
>> of bits in the code unit).
>>
>> UTF-32 has enough bits to represent all of the Unicode code points as
>> single integers, but as mentioned above there are integers that do not
>> represent valid codepoints (greater than 1b10ffff or in the surrogate pair
>> gap) J's unicode4 seems to violate this by allowing surrogate pairs as
>> valid unicode4
>>
>> 9 u: 55357 56832 NB. A surrogate pair
>> 😀
>> $ 9 u: 55357 56832 NB. UTF-32 is always one integer
>> 2
>> 3 u: 9 u: 55357 56832
>> 55357 56832 NB. Keeps result as a surrogate pair
>> 3!:0 [ 9 u: 55357 56832
>> 262144 NB. Unicode4
>>
>> 9 u: 128512 NB. Proper UTF-32 for 😀
>> 😀
>> $ 9 u: 128512 NB. result is an atom with empty shape
>>
>> 3!:0 [ 9 u: 128512 NB. Unicode4 type
>> 262144
>>
>> So, why have surrogate pairs? That is where UTF-16 comes in. In order to
>> cover the entire codespace up to 16b10ffff by using at most two code units,
>> UTF-16 uses surrogate pairs, integers from 16bd800 to 16bdfff. The first
>> integer of the pair is in the range from 16bd800 to 16bdbff and the second
>> integer is in the range from 16bdc00 to 16bdfff. This encoding scheme
>> provides confirmation that the surrogate pair is valid and allows a mapping
>> to the code points from 16b10000 to 16b10ffff that would not normally be
>> within reach of a single 16 bit code unit, but can be reached by two 16 bit
>> code units.
>>
>> 3 u: 7 u: 16bffff NB. Top of range for one 16 bit code unit
>> 65535
>> 3 u: 7 u: 16b10000 NB. Maps to surrogate pairs
>> 55296 56320
>> 7 u: 128512
>> 😀
>> 3 u: 7 u: 128512
>> 55357 56832
>> 3!:0 [ 7 u: 128512 NB. Unicode type
>> 131072
>>
>> To complete the encoding options UTF-8 maps the code points with using one
>> to four 8 bit code units in a pretty clever way. If the code unit is
>> between 0 and 16b7f then the encoding uses only one code unit and this
>> establishes the use of 7-bit ASCII in UTF-8. If the code unit is between
>> 16bc2 and 16bdf then it is always a two code unit encoding and the second
>> code unit must be within the range of 16b80 and 16bbf. Three code unit
>> encodings are signalled by a first code unit from 16be0 to 16bef and four
>> code unit encodings always begin with a code unit between 16bf0 and 16bf4.
>>
>> 8 u: 65
>> A
>> 3 u: 8 u: 65 NB. ASCII equivalent
>> 65
>> 3!:0 [ 8 u: 65 NB. literal
>> 2
>> 8 u: 295
>> ħ
>> 3 u: 8 u: 295
>> 196 167
>> 3!:0 [ 8 u: 295
>> 2 8 u: 3101
>> ఝ
>> 3 u: 8 u: 3101
>> 224 176 157
>> 3!:0 [ 8 u: 3101
>> 2
>> 8 u: 128512
>> 😀
>> 3 u: 8 u: 128512
>> 240 159 152 128
>> 3!:0 [ 8 u: 128512 NB. Literal type
>> 2
>>
>> Again problems arise when interpreting surrogate pairs, although in this
>> case the result is an error and not interpreted the way unicode4 does.
>> 8 u: 55357 56832
>> ������
>> 3 u: 8 u: 55357 56832
>> 237 160 189 237 184 128
>> 3!:0 [ 8 u: 55357 56832
>> 2
>>
>> So, where does this leave us? Well, we are kind of... sort of... doing
>> unicode, but in the process of making the process convenient, we have
>> drifted from the actual unicode standard.
>>
>> I'll wait to hear your response before I revise the wiki, as there are a
>> number of ways to go with explaining this, ranging from 'does not conform
>> to unicode spec' to 'it is what it is'
>>
>> Cheers, bob
>>
>> I wrote some code that will mirror unicode spec more closely when
>> converting unicode code points to the different encodings and from the
>> different encodings back to unicode code points. It is not really that
>> complicated, aside from the checking for valid ranges of encoded results.
>> The references for my process can be found here on pages 125-127
>> https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf
>>
>>
>> utf32 128512
>> 128512
>> displayutf32 128512
>> 😀
>> utf32 55357 56832 NB. invalid codepoints
>> |domain error: utf32
>> | utf32 55357 56832
>>
>> utf32ucp 55357 56832 NB. invalid utf-32 encoding
>> |domain error: utf32
>> | utf32ucp 55357 56832
>> utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point
>> 128512
>>
>> utf16 128512
>> 55357 56832
>> displayutf16 128512
>> 😀
>> utf16 55357 56832 NB. still invalid codepoints
>> |rank error: utf16
>> | utf16 55357 56832
>>
>> utf16ucp 128512 NB. invalid utf-16 encoding
>> |domain error: utf161
>> | utf16ucp 128512
>> utf16ucp 55357 56832 NB. valid utf-16 encoding returns unicode code
>> point
>> 128512
>>
>> utf8 128512
>> 240 159 152 128
>> displayutf8 128512
>> 😀
>> utf8 55357 56832 NB. still invalid codepoints
>> |rank error: utf8
>> | utf8 55357 56832
>> utf8ucp 128512 NB. invalid utf-8 encoding
>> |domain error: utf81
>> | utf8ucp 128512
>> utf8ucp 240 159 152 128 NB. valid utf-8 encoding returns unicode code
>> point
>> 128512
>>
>>
>> And here is the code (rough draft and could certainly be made more
>> readable) watch the word wrap on the longer lines.
>>
>> utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0 NB. accepts code
>> points within code space returns utf-32
>> displayutf32 =: 9 u: utf32
>> utf32ucp =: utf32 NB. accepts utf-32 code units - valid utf-32 code
>> units will return a valid code point
>>
>> utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff &
>> I.) NB. accepts code points within code space returns utf-16
>> displayutf16 =: 7 u: utf16
>> utf161 =: [: ^: (16bd77f&< +. 16be000&<: *. 16bfff&>:)
>> testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *.
>> 16bdfff&>:)@{:)
>> utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @.
>> testutf162
>> utf16ucp =: (utf161)`(utf162) @.( <:@#) NB. accepts utf-16 code units -
>> valid utf-16 code units will return a valid code point
>>
>> utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff
>> 16b10ffff & I.)
>> displayutf8 =: a. {~ utf8
>> utf81 =: [: ^: (16b7f&<)
>> testutf82 =: ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:)
>> utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82
>> testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
>> *. 16bbf&>:)@{:) +. ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *.
>> 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bed&= )@{. *.
>> (16b80&<: *. 16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:) +.
>> ((16bee&<: *. 16bef&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
>> 16bbf&>:)@{:)
>> utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @.
>> testutf83
>> testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
>> *. 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bf1&<: *.
>> 16bf3&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
>> 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:) +. ((16bf4&= )@{. *.
>> (16b80&<: *. 16b8f&>:)@(1&{) *. (16b80&<: *. 16b8f&>:)@(2&{) *. (16b80&<:
>> *. 16b8f&>:)@{:)
>> utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{)
>> ,_6&{.@#:@{:))
>> @. testutf84
>> utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB. accepts utf-8
>> code units - valid utf-8 code units will return a valid code point
>>
>>
>>> On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming <
>> [email protected]> wrote:
>>>
>>>>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote:
>>>>>>
>>>>>> The introductory page for Unicode
>>>>>>
>>>>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint
>>>>>>
>>>>>> does not discuss 4-byte characters, or the concept of surrogate pairs
>> with 2-byte characters.
>>>>>>
>>>>>> 4-byte precision is called unicode4 in NuVoc. If someone would add
>> discussion of these to the page, they would be a Hero. I'm just saying.
>>>>>>
>>>>>> Henry Rich
>>
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm