Re: [Jprogramming] Writing help needed: surrogate pairs

bill lam Fri, 13 Sep 2019 15:43:18 -0700

Bob,

You are right about the range of valid unicode codepoint for characters.


The J document perhaps didn't place enough emphasis on the difference
between unicode2/unicode4 J datatype and utf16/utf32 encoding.

Putting it more explicitly, unicode2/unicode4 J datatype has nothing to do
with utf16/utf32 unicode encoding. unicode2 is 16-bit wide literal and all
bit patterns are always valid, similar for unicode4 . However this may make
users more confusing  😂

automatic type promotion from byte to unicode2 to unicode4 always done atom
by atom.


On Sat, Sep 14, 2019, 1:08 AM 'robert therriault' via Programming <
[email protected]> wrote:

> Sorry to take so long to come up with information on the surrogate pairs,
> Henry, but it is both more complicated and simpler than it all appears.
>
> First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a
> gap from 16bd800 to 16bdfff (which is reserved for the surrogate pairs).
> This means that the only valid codepoints that can represent characters are
> 0 to 16bd7ff and 16be000 to 16b10ffff. The three different encoding schemes
> for Unicode are UTF-8, UTF-16 and UTF-32 (the number indicating the number
> of bits in the code unit).
>
> UTF-32 has enough bits to represent all of the Unicode code points as
> single integers, but as mentioned above there are integers that do not
> represent valid codepoints (greater than 1b10ffff or in the surrogate pair
> gap) J's unicode4 seems to violate this by allowing surrogate pairs as
> valid unicode4
>
>    9 u:  55357 56832 NB. A surrogate pair
> 😀
>    $ 9 u: 55357 56832  NB. UTF-32 is always one integer
> 2
>    3 u: 9 u: 55357 56832
> 55357 56832  NB. Keeps result as a surrogate pair
>    3!:0 [ 9 u:  55357 56832
> 262144  NB. Unicode4
>
>     9 u:  128512  NB. Proper UTF-32 for 😀
> 😀
>    $ 9 u: 128512 NB. result is an atom with empty shape
>
>    3!:0 [ 9 u:  128512  NB. Unicode4 type
> 262144
>
> So, why have surrogate pairs? That is where UTF-16 comes in. In order to
> cover the entire codespace up to 16b10ffff by using at most two code units,
> UTF-16 uses surrogate pairs, integers from 16bd800 to 16bdfff. The first
> integer of the pair is in the range from 16bd800 to 16bdbff and the second
> integer is in the range  from 16bdc00 to 16bdfff. This encoding scheme
> provides confirmation that the surrogate pair is valid and allows a mapping
> to the code points from 16b10000 to 16b10ffff that would not normally be
> within reach of a single 16 bit code unit, but can be reached by two 16 bit
> code units.
>
>    3 u: 7 u: 16bffff   NB. Top of range for one 16 bit code unit
> 65535
>    3 u: 7 u: 16b10000  NB. Maps to surrogate pairs
> 55296 56320
>       7 u: 128512
> 😀
>    3 u: 7 u: 128512
> 55357 56832
>    3!:0 [ 7 u:  128512  NB. Unicode type
> 131072
>
> To complete the encoding options UTF-8 maps the code points with using one
> to four 8 bit code units in a pretty clever way. If the code unit is
> between 0 and 16b7f then the encoding uses only one code unit and this
> establishes the use of 7-bit ASCII in UTF-8. If the code unit is between
> 16bc2 and 16bdf then it is always a two code unit encoding and the second
> code unit must be within the range of 16b80 and 16bbf. Three code unit
> encodings are signalled by a first code unit from 16be0 to 16bef and four
> code unit encodings always begin with a code unit between 16bf0 and 16bf4.
>
>     8 u: 65
> A
>    3 u: 8 u: 65  NB. ASCII equivalent
> 65
>    3!:0 [ 8 u: 65  NB. literal
> 2
>    8 u: 295
> ħ
>    3 u: 8 u: 295
> 196 167
>    3!:0 [ 8 u: 295
> 2  8 u: 3101
> ఝ
>    3 u: 8 u: 3101
> 224 176 157
>    3!:0 [ 8 u: 3101
> 2
>    8 u: 128512
> 😀
>   3 u: 8 u: 128512
> 240 159 152 128
>    3!:0 [ 8 u: 128512  NB. Literal type
> 2
>
> Again problems arise when interpreting surrogate pairs, although in this
> case the result is an error and not interpreted the way unicode4 does.
>    8 u: 55357 56832
> ������
>    3 u: 8 u: 55357 56832
> 237 160 189 237 184 128
>    3!:0 [ 8 u: 55357 56832
> 2
>
> So, where does this leave us? Well, we are kind of... sort of... doing
> unicode, but in the process of making the process convenient, we have
> drifted from the actual unicode standard.
>
> I'll wait to hear your response before I revise the wiki, as there are a
> number of ways to go with explaining this, ranging from 'does not conform
> to unicode spec' to 'it is what it is'
>
> Cheers, bob
>
> I wrote some code that will mirror unicode spec more closely when
> converting unicode code points to the different encodings and from the
> different encodings back to unicode code points. It is not really that
> complicated, aside from the checking for valid ranges of encoded results.
> The references for my process can be found here on pages 125-127
> https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf
>
>
>    utf32 128512
> 128512
>    displayutf32 128512
> 😀
>    utf32 55357 56832  NB. invalid codepoints
> |domain error: utf32
> |       utf32 55357 56832
>
>    utf32ucp 55357 56832 NB. invalid utf-32 encoding
> |domain error: utf32
> |       utf32ucp 55357 56832
>    utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point
> 128512
>
>    utf16 128512
> 55357 56832
>    displayutf16 128512
> 😀
>    utf16 55357 56832  NB. still invalid codepoints
> |rank error: utf16
> |       utf16 55357 56832
>
>    utf16ucp 128512  NB. invalid utf-16 encoding
> |domain error: utf161
> |       utf16ucp 128512
>    utf16ucp 55357 56832  NB. valid utf-16 encoding returns unicode code
> point
> 128512
>
>    utf8 128512
> 240 159 152 128
>    displayutf8 128512
> 😀
>    utf8 55357 56832  NB. still invalid codepoints
> |rank error: utf8
> |       utf8 55357 56832
>    utf8ucp 128512  NB. invalid utf-8 encoding
> |domain error: utf81
> |       utf8ucp 128512
>    utf8ucp 240 159 152 128  NB. valid utf-8 encoding returns unicode code
> point
> 128512
>
>
> And here is the code (rough draft and could certainly be made more
> readable) watch the word wrap on the longer lines.
>
> utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0  NB. accepts code
> points within code space returns utf-32
> displayutf32 =: 9 u: utf32
> utf32ucp =: utf32  NB.  accepts utf-32 code units - valid utf-32 code
> units will return a valid code point
>
> utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff &
> I.) NB. accepts code points within code space returns utf-16
> displayutf16 =: 7 u: utf16
> utf161 =: [: ^: (16bd77f&<  +. 16be000&<: *. 16bfff&>:)
> testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *.
> 16bdfff&>:)@{:)
> utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @.
> testutf162
> utf16ucp =: (utf161)`(utf162) @.( <:@#) NB.  accepts utf-16 code units -
> valid utf-16 code units will return a valid code point
>
> utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff
> 16b10ffff & I.)
> displayutf8 =: a. {~ utf8
> utf81 =: [: ^: (16b7f&<)
> testutf82 =:  ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:)
> utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82
> testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
> *. 16bbf&>:)@{:)   +.   ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *.
> 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bed&= )@{. *.
> (16b80&<: *. 16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.
>  ((16bee&<: *. 16bef&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
> 16bbf&>:)@{:)
> utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @.
> testutf83
> testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
> *. 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf1&<: *.
> 16bf3&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
> 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf4&= )@{. *.
> (16b80&<: *. 16b8f&>:)@(1&{) *. (16b80&<: *. 16b8f&>:)@(2&{) *. (16b80&<:
> *. 16b8f&>:)@{:)
> utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{) 
> ,_6&{.@#:@{:))
> @. testutf84
> utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB.  accepts utf-8
> code units - valid utf-8 code units will return a valid code point
>
>
> > On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming <
> [email protected]> wrote:
> >
> >>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote:
> >>>>
> >>>> The introductory page for Unicode
> >>>>
> >>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint
> >>>>
> >>>> does not discuss 4-byte characters, or the concept of surrogate pairs
> with 2-byte characters.
> >>>>
> >>>> 4-byte precision is called unicode4 in NuVoc.  If someone would add
> discussion of these to the page, they would be a Hero.  I'm just saying.
> >>>>
> >>>> Henry Rich
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Writing help needed: surrogate pairs

Reply via email to