Re: [Jbeta] U8 and unicode

Don Guinn Wed, 29 May 2013 13:22:34 -0700

   'def',2 u: 'abc'

defabc


   'def',7 u: 'abc'

defabc

   3!:0 'de∑',2 u: 'abc'

131072

   'de∑',2 u: 'abc'

deâabc

   3 u: 'de∑',2 u: 'abc'

100 101 226 136 145 97 98 99

   'de∑',7 u: 'abc'

de∑abc

   2 u: 'de∑'

deâ

   7 u: 'de∑'

de∑


For plane ASCII 2&u: and 7&u: are the same. But when wchar and char with U8
are together 2&u: gives strange results. 7&u: works as one would expect.
Notice the sequence 226 136 145. That was a U8 code that has been
destroyed. I'm not asking that 2&u: be changed, just that u: monadic be
7&u: for character and , apply 7&u: to char before catenating with wchar.


True, J doesn't do + and - on characters, but it does do , . That is a
calculation that is being done incorrectly. As to breaking code. All that
code that depended on _128{a. being characters is obsolete. Line drawing
characters are either 16-26 in a. or unicode above U+255, not in _128{.a. .
Is any of it even around any more?


All I'm asking is that J be aware of the character type. Char containing U8
codes should not be blindly concatenated with wchar without conversion any
more than an integer should be blindly added to a floating point number
without conversion. As you say, we have U16 and we have full 32 bit Unicode
coming. All the more reason that J be aware of what needs to be done when
these types get mixed together. But I agree, UTFx is not really character.
It is for transmitting Unicode in a system independent manner and ASCII
compatibility. But its use internally in a computer has its problems.


Treat character data like numeric data. In J we deal with numbers, not
integers different from floating point different from complex. J takes care
of mixing the different ways numbers are represented internally. We are not
forced to worry about it.




On Wed, May 29, 2013 at 12:42 PM, Raul Miller <[email protected]> wrote:

> On Wed, May 29, 2013 at 1:30 PM, Don Guinn <[email protected]> wrote:
> > I don't think so. But maybe. The thing is that everything in J assumes
> that
> > literal (char) is U8 except monadic u: default for char and concatenating
> > literal with unicode (wchar).
>
> I do not think this is accurate.  For example:
>
>    'def',2 u: 'abc'
> defabc
>    3!:0 'abc'
> 2
>    3!:0]2 u: 'abc'
> 131072
>    3!:0 'def',2 u: 'abc'
> 131072
>
> The thing to realize is that J has a concept of a "character literal".
> Conceptually, this is a number which is treated as a character (we do
> not support arithmetic on literals, for better or worse).
> Interpretation of that number belongs to the context where that number
> is delivered.
>
> U8 is an example of something where J delivers a sequence of numbers
> to some external context.
>
> > Just make everything assume that literal is U8.
>
> Why should we make this assumption when it would break existing code?
>
> > I can't think of a case where one would want it otherwise if the char
> > were really text.
>
> Does this mean you cannot think of uses for u16 (wchar) or u32 (not
> yet implemented)? If so, should this lack of ability to think of uses
> be valid justification for breaking backwards compatibility?
>
> > If there is a case where one wants to copy the lower byte
> > and zero the upper byte to wchar, apply 2&u: . Does anybody now use
> > _128{.a. characters for anything other than U8? If not, such a change
> > should not affect anyone. char data which does not contain any U8 would
> not
> > be affected.
>
> It seems to me that box drawing characters are in _128{.a. and that
> they are not u8 characters.
>
> It also seems to me that we could support u16 (and maybe u32) unicode
> box drawing characters here.
>
> > Right now if one has both wchar and U8 in an application, care must be
> > taken to make sure that any char data that might contain U8 is run
> through
> > 7&u: before concatenating it to wchar.
>
> Yes.
>
> One issue here is that most u8 characters are not J characters, but
> are instead a sequence of J characters.
>
> > Optimization may convert wchar to char unbeknownst to the programmer.
> > Not now probably, but who knows in the future.
>
> This, then, should be a documentation issue.
>
> > If I combine an integer with a real I expect the integer to be converted
> to
> > real before combining.
>
> Yes. Note also that here you are converting 1 number to 1 number.
>
> > It is not necessary for me to convert the integer to
> > real. Why not have concatenation of char and wchar work the same way?
>
> That's how it works, IF (and only if) we are talking about individual
> characters. Translating multi-character sequences to individual
> characters necessarily violates some invariants, and so should not be
> done when the programmer does not explicitly call for it.
>
> > Like I showed with z,":z, where the result of ":z is U8 gave unexpected
> results.
>
> I do not understand this example. But I believe that if z is tagged as
> literal then (-: ":)z should be true. (In other words z and ": z
> should be identical when viewed from inside J.)
>
> > Before Unicode _128{.a. was needed for non-ASCII characters. Not any
> more.
> > Do away with the idea that literal and U8 are different.
>
> u16 is also literal. So you seem to be saying that u16 should not be
> different from u8.
>
> --
> Raul
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jbeta] U8 and unicode

Reply via email to