Re: [Jprogramming] Writing help needed: surrogate pairs

bill lam Fri, 13 Sep 2019 20:04:39 -0700

Bob,

unicode2 datatype has no concept of surrogate.


This is similar to wchar_t c type, every 16-bit character is legal,
disregarding surrogate pair interpretation

Certainly we need to apply surrogate pair interpretation if it is used as
utf16 encoding, and highly unlikely if unicode2 not being used for utf16
encoding.

IMO the imperfect behavior of isolated surrogate is a deficiency of utf16
encoding itself. The same applies to winapi in which case, number of
characters are counted as the number of wchar_t character not Unicode
codepoint, isolated surrogate can happen there too. This makes handling of
utf16 a bit difficult unless you can assume surrogate pair never happen, eg
how would you truncate or reverse a utf16 string.

On Sat, Sep 14, 2019, 8:42 AM 'robert therriault' via Programming <
[email protected]> wrote:

> Thanks Bill,
>
> I think that I can still safely describe the use of surrogate pairs in
> unicode and unicode4 though, because it seems that those encodings do
> follow the utf-16 approach, right?
>
> Also, it seems to me that the byte encoding does follow the utf-8
> approach, where the lower 127 ASCII characters are just a special case of
> utf-8 encoding, or is that the wrong way to look at that.
>
>     8 u: 65
> A
>    3 u: 8 u: 65
> 65
>    65 { a.
> A
>     8 u: 162
> ¢
>    3 u: 8 u: 162
> 194 162
>    162 { a.
> �
>    194 162 { a.
> ¢
>
> Cheers, bob
>
> > On Sep 13, 2019, at 3:42 PM, bill lam <[email protected]> wrote:
> >
> > Bob,
> >
> > You are right about the range of valid unicode codepoint for characters.
> >
> > The J document perhaps didn't place enough emphasis on the difference
> > between unicode2/unicode4 J datatype and utf16/utf32 encoding.
> >
> > Putting it more explicitly, unicode2/unicode4 J datatype has nothing to
> do
> > with utf16/utf32 unicode encoding. unicode2 is 16-bit wide literal and
> all
> > bit patterns are always valid, similar for unicode4 . However this may
> make
> > users more confusing  😂
> >
> > automatic type promotion from byte to unicode2 to unicode4 always done
> atom
> > by atom.
> >
> >
> > On Sat, Sep 14, 2019, 1:08 AM 'robert therriault' via Programming <
> > [email protected]> wrote:
> >
> >> Sorry to take so long to come up with information on the surrogate
> pairs,
> >> Henry, but it is both more complicated and simpler than it all appears.
> >>
> >> First, the simple part. Unicode has a codespace of 0 to 16b10ffff with a
> >> gap from 16bd800 to 16bdfff (which is reserved for the surrogate pairs).
> >> This means that the only valid codepoints that can represent characters
> are
> >> 0 to 16bd7ff and 16be000 to 16b10ffff. The three different encoding
> schemes
> >> for Unicode are UTF-8, UTF-16 and UTF-32 (the number indicating the
> number
> >> of bits in the code unit).
> >>
> >> UTF-32 has enough bits to represent all of the Unicode code points as
> >> single integers, but as mentioned above there are integers that do not
> >> represent valid codepoints (greater than 1b10ffff or in the surrogate
> pair
> >> gap) J's unicode4 seems to violate this by allowing surrogate pairs as
> >> valid unicode4
> >>
> >>   9 u:  55357 56832 NB. A surrogate pair
> >> 😀
> >>   $ 9 u: 55357 56832  NB. UTF-32 is always one integer
> >> 2
> >>   3 u: 9 u: 55357 56832
> >> 55357 56832  NB. Keeps result as a surrogate pair
> >>   3!:0 [ 9 u:  55357 56832
> >> 262144  NB. Unicode4
> >>
> >>    9 u:  128512  NB. Proper UTF-32 for 😀
> >> 😀
> >>   $ 9 u: 128512 NB. result is an atom with empty shape
> >>
> >>   3!:0 [ 9 u:  128512  NB. Unicode4 type
> >> 262144
> >>
> >> So, why have surrogate pairs? That is where UTF-16 comes in. In order to
> >> cover the entire codespace up to 16b10ffff by using at most two code
> units,
> >> UTF-16 uses surrogate pairs, integers from 16bd800 to 16bdfff. The first
> >> integer of the pair is in the range from 16bd800 to 16bdbff and the
> second
> >> integer is in the range  from 16bdc00 to 16bdfff. This encoding scheme
> >> provides confirmation that the surrogate pair is valid and allows a
> mapping
> >> to the code points from 16b10000 to 16b10ffff that would not normally be
> >> within reach of a single 16 bit code unit, but can be reached by two 16
> bit
> >> code units.
> >>
> >>   3 u: 7 u: 16bffff   NB. Top of range for one 16 bit code unit
> >> 65535
> >>   3 u: 7 u: 16b10000  NB. Maps to surrogate pairs
> >> 55296 56320
> >>      7 u: 128512
> >> 😀
> >>   3 u: 7 u: 128512
> >> 55357 56832
> >>   3!:0 [ 7 u:  128512  NB. Unicode type
> >> 131072
> >>
> >> To complete the encoding options UTF-8 maps the code points with using
> one
> >> to four 8 bit code units in a pretty clever way. If the code unit is
> >> between 0 and 16b7f then the encoding uses only one code unit and this
> >> establishes the use of 7-bit ASCII in UTF-8. If the code unit is between
> >> 16bc2 and 16bdf then it is always a two code unit encoding and the
> second
> >> code unit must be within the range of 16b80 and 16bbf. Three code unit
> >> encodings are signalled by a first code unit from 16be0 to 16bef and
> four
> >> code unit encodings always begin with a code unit between 16bf0 and
> 16bf4.
> >>
> >>    8 u: 65
> >> A
> >>   3 u: 8 u: 65  NB. ASCII equivalent
> >> 65
> >>   3!:0 [ 8 u: 65  NB. literal
> >> 2
> >>   8 u: 295
> >> ħ
> >>   3 u: 8 u: 295
> >> 196 167
> >>   3!:0 [ 8 u: 295
> >> 2  8 u: 3101
> >> ఝ
> >>   3 u: 8 u: 3101
> >> 224 176 157
> >>   3!:0 [ 8 u: 3101
> >> 2
> >>   8 u: 128512
> >> 😀
> >>  3 u: 8 u: 128512
> >> 240 159 152 128
> >>   3!:0 [ 8 u: 128512  NB. Literal type
> >> 2
> >>
> >> Again problems arise when interpreting surrogate pairs, although in this
> >> case the result is an error and not interpreted the way unicode4 does.
> >>   8 u: 55357 56832
> >> ������
> >>   3 u: 8 u: 55357 56832
> >> 237 160 189 237 184 128
> >>   3!:0 [ 8 u: 55357 56832
> >> 2
> >>
> >> So, where does this leave us? Well, we are kind of... sort of... doing
> >> unicode, but in the process of making the process convenient, we have
> >> drifted from the actual unicode standard.
> >>
> >> I'll wait to hear your response before I revise the wiki, as there are a
> >> number of ways to go with explaining this, ranging from 'does not
> conform
> >> to unicode spec' to 'it is what it is'
> >>
> >> Cheers, bob
> >>
> >> I wrote some code that will mirror unicode spec more closely when
> >> converting unicode code points to the different encodings and from the
> >> different encodings back to unicode code points. It is not really that
> >> complicated, aside from the checking for valid ranges of encoded
> results.
> >> The references for my process can be found here on pages 125-127
> >> https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf
> >>
> >>
> >>   utf32 128512
> >> 128512
> >>   displayutf32 128512
> >> 😀
> >>   utf32 55357 56832  NB. invalid codepoints
> >> |domain error: utf32
> >> |       utf32 55357 56832
> >>
> >>   utf32ucp 55357 56832 NB. invalid utf-32 encoding
> >> |domain error: utf32
> >> |       utf32ucp 55357 56832
> >>   utf32ucp 128512 NB. valid utf-32 encoding returns unicode code point
> >> 128512
> >>
> >>   utf16 128512
> >> 55357 56832
> >>   displayutf16 128512
> >> 😀
> >>   utf16 55357 56832  NB. still invalid codepoints
> >> |rank error: utf16
> >> |       utf16 55357 56832
> >>
> >>   utf16ucp 128512  NB. invalid utf-16 encoding
> >> |domain error: utf161
> >> |       utf16ucp 128512
> >>   utf16ucp 55357 56832  NB. valid utf-16 encoding returns unicode code
> >> point
> >> 128512
> >>
> >>   utf8 128512
> >> 240 159 152 128
> >>   displayutf8 128512
> >> 😀
> >>   utf8 55357 56832  NB. still invalid codepoints
> >> |rank error: utf8
> >> |       utf8 55357 56832
> >>   utf8ucp 128512  NB. invalid utf-8 encoding
> >> |domain error: utf81
> >> |       utf8ucp 128512
> >>   utf8ucp 240 159 152 128  NB. valid utf-8 encoding returns unicode code
> >> point
> >> 128512
> >>
> >>
> >> And here is the code (rough draft and could certainly be made more
> >> readable) watch the word wrap on the longer lines.
> >>
> >> utf32 =: ]`[:`]`[:@.( 16bd7ff 16bdfff 16b10ffff & I.)"0  NB. accepts
> code
> >> points within code space returns utf-32
> >> displayutf32 =: 9 u: utf32
> >> utf32ucp =: utf32  NB.  accepts utf-32 code units - valid utf-32 code
> >> units will return a valid code point
> >>
> >> utf16 =: ]`[:`]`([: 3&u: 7&u:)`[: @.( 16bd800 16bdfff 16bffff 16b10ffff
> &
> >> I.) NB. accepts code points within code space returns utf-16
> >> displayutf16 =: 7 u: utf16
> >> utf161 =: [: ^: (16bd77f&<  +. 16be000&<: *. 16bfff&>:)
> >> testutf162 =:((16bd800&<: *. 16bdbff&>:)@{. *. (16bdc00&<: *.
> >> 16bdfff&>:)@{:)
> >> utf162 =: [:`(#. @: ((_20&{. #: 16b10000) + _20&{.@,@:(6&}."1)@#:)) @.
> >> testutf162
> >> utf16ucp =: (utf161)`(utf162) @.( <:@#) NB.  accepts utf-16 code units -
> >> valid utf-16 code units will return a valid code point
> >>
> >> utf8 =: (3&u:@":@(9&u:))`[:`(3&u:@":@(9&u:))`[: @.( 16bd7ff 16bdfff
> >> 16b10ffff & I.)
> >> displayutf8 =: a. {~ utf8
> >> utf81 =: [: ^: (16b7f&<)
> >> testutf82 =:  ((16bc2&<: *. 16bdf&>:)@{. *. (16b80&<: *. 16bbf&>:)@{:)
> >> utf82 =: [:`(#. @: (_5&{.@#:@{. , _6&{.@#:@{:)) @. testutf82
> >> testutf83 =: ((16be0&= )@{. *. (16ba0&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
> >> *. 16bbf&>:)@{:)   +.   ((16be1&<: *. 16bec&>:)@{. *. (16b80&<: *.
> >> 16bbf&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bed&= )@{. *.
> >> (16b80&<: *. 16b9f&>:)@(1&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.
> >> ((16bee&<: *. 16bef&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
> *.
> >> 16bbf&>:)@{:)
> >> utf83 =: [:`(#. @: (_4&{.@#:@{. , _6&{.@#:@(1&{) ,_6&{.@#:@{:)) @.
> >> testutf83
> >> testutf84 =: ((16bf0&= )@{. *. (16b90&<: *. 16bbf&>:)@(1&{) *. (16b80&<:
> >> *. 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf1&<: *.
> >> 16bf3&>:)@{. *. (16b80&<: *. 16bbf&>:)@(1&{) *. (16b80&<: *.
> >> 16bbf&>:)@(2&{) *. (16b80&<: *. 16bbf&>:)@{:)   +.   ((16bf4&= )@{. *.
> >> (16b80&<: *. 16b8f&>:)@(1&{) *. (16b80&<: *. 16b8f&>:)@(2&{) *.
> (16b80&<:
> >> *. 16b8f&>:)@{:)
> >> utf84 =: [:`(#. @: (_3&{.@#:@{. , _6&{.@#:@(1&{) , _6&{.@#:@(2&{)
> ,_6&{.@#:@{:))
> >> @. testutf84
> >> utf8ucp =: (utf81)`(utf82)`(utf83)`(utf84) @.( <:@#) NB.  accepts utf-8
> >> code units - valid utf-8 code units will return a valid code point
> >>
> >>
> >>> On Sep 4, 2019, at 8:23 AM, 'robert therriault' via Programming <
> >> [email protected]> wrote:
> >>>
> >>>>>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]>
> wrote:
> >>>>>>
> >>>>>> The introductory page for Unicode
> >>>>>>
> >>>>>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint
> >>>>>>
> >>>>>> does not discuss 4-byte characters, or the concept of surrogate
> pairs
> >> with 2-byte characters.
> >>>>>>
> >>>>>> 4-byte precision is called unicode4 in NuVoc.  If someone would add
> >> discussion of these to the page, they would be a Hero.  I'm just saying.
> >>>>>>
> >>>>>> Henry Rich
> >>
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Writing help needed: surrogate pairs

Reply via email to