Re: [Haskell-cafe] Re: Roman Numerals and Haskell Syntax abuse

2004-07-07 Thread Kent Karlsson
 now 134
 roman=(!6);n!a|n1=|n=t=s!!a:(n-t)!a|c=t=s!!(2*e):c!a|10=n!(a-1)where(d,m)=a`divMod`2;e=d+m-1;s=ivxlcdm;c=10^e+n;t=10^d*(1+4*m)


Gosh!

Anyway, you missed the roman symbols for 5000 (U+2181) and
1 (U+2182)... ;-) The ones for 5 and 10 aren't in
Unicode yet, nor is the canopy used to write even larger
values (see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2738).

   /Kent K


___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: Why are strings linked lists?

2003-12-08 Thread Kent Karlsson

 GHC 6.2 (shortly to be released) also supports toUpper, toLower, and
the
 character predicates isUpper, isLower etc. on the full Unicode
character
 set.
 
 There is one caveat: the implementation is based on the C library's
 towupper() and so on, so the support is only as good as the C library
 provides, and it relies on wchar_t being equivalent to Unicode (the
 sensible choice, but not all libcs do this).

Now, why would one want to base this on C's wchar_t and its
w routines? wchar_t is sometimes (isolated) UTF-32 code units,
including in Linux, sometimes it is (isolated) UTF-16 code units,
including in Windows, and sometimes something utterly useless.
The casing data is not reliable (it could be entirely wrong, and even
locale dependent in an erroneous way), nor kept up to date with the
Unicode character database in all implementations (even where
wchar_t is some form of Unicode/10646). wchar_t is best forgotten,
especially for portable programs.

Please instead use ICU's UChar32, which is (isolated) UTF-32, and
and Unicode::isUpperCase(cp), Unicode::toUpperCase(cp) (C++ here),
etc. The ICU data is kept up-to-date with Unicode versions. The
case mappings are the simplistic ones, not taking SpecialCasing.txt
into account, just the UnicodeData.txt case mapping data. It is thus
not locale dependent, nor context dependent, nor doesn't cae-map
a character to more than one character (so it is not fully appropriate
for strings, but still much, much better than C's wchar_t and its
w-functions).

 Proper support for character set conversions in the I/O library has
been
 talked about for some time, and there are a couple of implementations

One can base this on the ICU character encoding conversions. I would
very much recommend that over the C locale dependent mb
conversion routines, for the same reasons as above.

/kent k

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


RE: Language-Independent Arithmetic

2003-06-23 Thread Kent Karlsson


Alastair Reid wrote:
 On Sunday 22 June 2003 6:30 am, Ashley Yakeley wrote:
  From the Haskell 98 Report, sec. 6.4:
 
  The default floating point operations defined by
  the Haskell  Prelude do not  conform to current
  language independent arithmetic (LIA) standards.
  These standards require considerably more complexity
  in the numeric structure and have thus been
  relegated to a library.
 
  Is this true? Which library?
 
 If I recall correctly, the LIA standard requires control over 
 rounding modes, 

No, it does not.  But IEC 60559 (a.k.a. IEEE 754, or IEEE f.p.
arithmetic) does.

(Side remark: the quote above refers only to LIA-1.  LIA-2 (elementary
functions) is now done, and LIA-3 (complex integers and complex
floating point) is in the works.  All three LIA parts are relevant to
Haskell,
since Haskell includes elementary functions and complex floating point
values and operations.)

 requires that you provide several variants of each 
 comparision operation 
 which respond differently to +0,-0,infinity,NaN, etc. 

No it does not.  Nor does IEC 60559, while it informatively suggests
the possibility w.r.t NaNs (not w.r.t. signed zeroes or infinities); I
don't
think that suggestion has been picked up by anyone though.  The C
committee considered it, but apparently rejected it.

 I 
 think some of the 
 obvious type signatures would have to change too.

How?  I haven't been looking into this for quite a while, but I don't
recall any such problem.  There are a few missing constants, and a
few missing operations, though.



A major problem, however, is error handling.  While LIA allows for
write error message and terminate, that's rarely the best way of
handling arithmetic errors.  The recording of indicators approach,
much like IEC 60559 default error handling, is problematic in
Haskell due to the hidden state.

 What would be needed to conform to LIA would be to add a 
 library providing all 
 the operations.   The default ops (i.e., the Prelude) would 
 still not conform 
 to LIA but that may not be such a big deal.

It is the intent for LIA-1 that most programming languages (and their
implementations) should be able to conform to LIA-1 without too much
trouble.  Even if it means terminate on error in a conforming mode of
operation.

/Kent Karlsson

(Current editor of the LIA series of standards.)


 
 --
 Alastair Reid

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell


RE: gcd 0 0 = 0

2001-12-19 Thread Kent Karlsson


Let me try again:

greatest - maximum/supremum of a set of integers (plain everyday order)

common - intersection (plain everyday intersection of sets)

divisor (of an integer value v) -
an integer value m, such that v/m is defined and, if so, is an integer

factor (of an integer value v) -
an integer value m, such that there is an integer value n such that 
m*n=v

So (mock Haskell syntax; set expression really):

greatest_common_divisor a b = max (intersection {all divisors of a} {all 
divisors of b})


What is the supremum (result of max in the expression above) if a and b are both 0?
(You're allowed to use values not prescribed by Haskell to exist. ;-)

(You can replace divisors by factors in that expression and still get the same 
result.)

I may agree that an operation *similar* to gcd, where 0,0 as argument
returns 0 is useful (maybe even more useful than gcd!).  But that operation
is still not the gcd (and might even return other results thaN gcd also for
other value pairs than 0,0; in particlar negatives; depending on what is
found most useful).

If you want to replace gcd by some other, similar, operation, please go ahead.
But call it something else, because it is something else. If you want to generalise
that to polynomials or Gaussian integers (or at least imaginary integers, as opposed
to complex integers), fine (though not for the current standard Haskell library).
(Micheal, I am interested in the Guassian integer variety of this. If you like,
you can expand on what you said in an off-list message, or give me a reference.)

Kind (and somewhat fundamentalist) regards
/kent k


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Jan de Wit
 Sent: den 19 december 2001 01:15
 To: [EMAIL PROTECTED]
 Subject: Re: gcd 0 0 = 0
 
 
 Why not define gcd a b as the largest (in 'normal' order) 
 integer d such
 that the set of sums of
 multiples of a and b {na+mb | n - Z, m - Z} is equal to the set of
 multiples of d
 {nd | n - Z}? Easy to understand, no talk of division, 
 lattices, rings,
 ideals etcetera, and it covers the cases with 0.
 
 Cheers, Jan de Wit
 
 
 
 ___
 Haskell mailing list
 [EMAIL PROTECTED]
 http://www.haskell.org/mailman/listinfo/haskell

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



RE: gcd 0 0 = 0

2001-12-18 Thread Kent Karlsson


  Simon == Simon Peyton-Jones [EMAIL PROTECTED] writes:
 
 Simon Christoph does not like this

I still don't like this.  0 has never, and will never, divide anything,
in particular not 0.  0 may be a prime factor of 0 (see also below!),
but that is different. It is not the greatest (in the ordinary sense)
divisor of 0.  Indeed, +infinity is a much larger divisor of 0...

I'm not in favour of using a very special-purpose order, not used for
anything else, and that isn't even an order but a preorder, just to
motivate gcd 0 0 = 0.  Even if using this very special-purpose preorder,
an infinity would be included in the 'top' equivalence class, and if we
pick a representative value on the basis of which is 'greater' in the
ordinary sense for integers augmented with infinities(!), then +infinity
should be the representative value.  Thus, in any case, gcd 0 0 = +infinity.

This is easy enough for Integer, where +infinity and -infinity can easily
be made representable (and should be made representable), but harder for
a 'pure hardware' Int datatype.  But in an ideal world, gcd 0 0 = +infinity
with no error or exception.

 It's OK if the definition is clear; it wasn't using
 the words positive or greatest integer.
 
 Stating gcd 0 0 = 0 explicitly is a good thing,
 even if it could be expressed verbatim;
 people may think about the mathematical background,
 but they should not need to think about the
 meaning of the definition.


 Anyway, I'm still against promoting 1 to a prime number :-)

Why?  If EVERY natural number is to have a  prime factorisation, then BOTH
0 AND 1 have to be promoted to prime numbers; otherwise 1 and 0 cannot be
prime factorised; in addition to that 1 is then a prime factor of any number
(that can be excluded from the *minimal* list of prime factors except for 1)...
There is no fundamental reason to except 1 from being a prime number.  But
there is a fundamental reason to say that 0 can never be a divisor (i.e. 0|0
is false; x|y is true iff x is a *non-zero* factor of y; the 'non-zero' part
is often left implicit (e.g. one is only talking about strictly positive
integers), which is part of the reason why we are having this discussion).

If you want something similar to gcd, but that returns 0 for 0 and 0, then
it is the 'product of all common prime factors'; where 1 has the (non-minimal)
prime factorisation [1, 1, ...], 0 has the (non-minimal) prime factorisation
[0, 1, 2, ...], and 1 is included at least once in the (non-minimal) prime
factorisation of any natural number. If you want a parallel to the divides
relation where 0 and 0 are related: 0 is a factor of 0.  A prime number
is a number that has no integer *between* 1 and itself as factors. People
often say except instead of between, but that does not work for 0, nor
for the non-minimal prime factorisations that people seem to be interested
in, given the interest in having gcd 0 0 = 0 (which isn't the gc*d*!). Again,
the context is often strictly positive integers, and 'between' and 'except'
are then equivalent.  For no apparent reason 1 is usually also excepted, but
that does not work for the prime factorisation of 1, nor for finding the
product of all common prime factors of 1 and another natural number... For
integers, -1 is also a prime number, and for imaginary integers, i is also
a prime number...  I'm sure somebody can give a nice definition of a partial
order (not just preorder) lattice with 1 as the min value and 0 as the max
value (just larger than the infinities), if you absolutely want a lattice
with a gcd-*like* meet and lcm-*like* join for this (the, positive bias,
factor-of order).

I'd be happy to support such gcd-*like* (pcf?) and lcm-*like* functions, but
they aren't the gcd, nor the lcm (e.g. pcf (-1) (-1) = -1, not 1, etc.).
If you don't like adding these, then I suggest leaving things completely as
they are.  Squeezing in two operations into one just because they have the
same results over the first quadrant is not something I find to be too good.

Odd one out?
/kent k


___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



RE: GCD

2001-12-11 Thread Kent Karlsson


I don't think preorders of any kind should be involved here.
Just the ordinary order on integers. No divisibility preorder (I'm not
sure how that is even defined, so how it could be natural beats me), no
absolute value.

I find the unaltered text Simon quoted to be fine as is.

But for those who like to be more precise (forgive the TeXese):


% Most of you may wish to stop reading at this point.



% I is the set of integers representable in the integral datatype.
% result_I may return overflow or the argument, as appropriate.

\begin{example}\atab
  $gcd_I : I \times I \rightarrow I \cup \{\overflow, \infinitary\}$
\end{example}
\begin{example}\atab
  $gcd_I(x,y)$
\$= result_I(\max\{v \in \ZZ ~~|~~ v|x $ and  $v|y\})$\\
\  \if $x,y \in I$ and ($x \neq 
0$ or  $y \neq 0$)\\
\$= \infinitary(\posinf)$  \if $x = 0$ and $y = 0$
\end{example}

% There is no need to say v0 above, since there are always positive values in that
% set, and max picks the largest/greatest one.  0 has all integer values except(!) 0
% as divisors. So for gcd 0 0 (maximum, supremum really, of the intersection of the two
% sets of divisors) the result is really positive infinity, which should be the result
% returned when representable (recommendable for Haskell's Integer datatype). gcd will
% overflow for instances like gcd (minBound::Int) (minBound::Int). 

\begin{example}\atab\\
  $lcm_I : I \times I \rightarrow I \cup \{\overflow\}$
\end{example}
\begin{example}\atab
  $lcm_I(x,y)$
\$= result_I(\min\{v \in \ZZ ~~|~~ x|v $ and $ y|v $ and $ v  0\})$\\
\  \if $x,y \in I$ and $x \neq 0$ and $y \neq 
0$\\
\$= 0$ \if $x,y \in I$ and ($x = 0$ or  $y = 0$)
\end{example}

% the v0 is needed here, since the set here would otherwise always contain
% infinitely many negative values, and then minimum of that...




Kind regards
/kent k



 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of S.M.Kahrs
 Sent: den 11 december 2001 11:21
 To: [EMAIL PROTECTED]
 Subject: Re: GCD
 
 
 The natural reading of 'greatest' is, of course,
 the greatest in the divisibility preorder (it's partial order
 on natural numbers but only a preorder on integers).
 Thus, gcd 0 0 = 0.
 
 3 and -3 are equivalent in that preoder.
 
 Thus, an additional comment may be in order.
 
 Stefan
 
 ___
 Haskell mailing list
 [EMAIL PROTECTED]
 http://www.haskell.org/mailman/listinfo/haskell

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: Haskell 98 - Standard Prelude - Floating Class

2001-10-15 Thread Kent Karlsson


- Original Message - 
From: Jerzy Karczmarczuk [EMAIL PROTECTED]
...
 Simon Peyton-Jones:
  
  Russell O'Connor suggests:
 
  | but sinh and cosh can easily be defined in terms of exp
  |
  | sinh x = (exp(x) - exp(-x))/2
  | cosh x = (exp(x) + exp(-x))/2
 
  | I suggest removing sinh and cosh from the minimal complete
  | definition, and add the above defaults.
  
  This looks pretty reasonable to me.  We should have default methods
  for anything we can.
  
  Comments?
 
 Three.
 
 1. Actually, I wouldn't even call that default definitions. These ARE
definitions of sinh and cosh.

Mathematically, yes.  Numerically, no.  Even if 'exp' is implemented
with high accuracy, the suggested defaults may return a very inaccurate
(in ulps) result.  Take sinh near zero.  sinh(x) with x very close to 0 should
return x.  With the above 'default' sinh(x) will return exactly 0 for a relatively
wide interval around 0, which is the wrong result except for 0 itself.

In general, this is why LIA-2 (Language Independent Arithmetic, part 2,
Elementary numerical functions, ISO/IEC 10967-2:2001) rarely attempts to
define one numerical operation in terms of other numerical operations.  That
is done only when the relationship is exact (even if the operations themselves
are inexact).  That is not the case for the abovementioned operations. But
it is the case for the relationship between the complex sin operation and the
complex sinh operation, for instance. (Complex will be covered by LIA-3.)

Kind regards
/Kent Karlsson



___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: Unicode support

2001-10-09 Thread Kent Karlsson

Just to clear up any misunderstanding:

- Original Message -
From: Ashley Yakeley [EMAIL PROTECTED]
To: Haskell List [EMAIL PROTECTED]
Sent: Monday, October 01, 2001 12:36 AM
Subject: Re: Unicode support


 At 2001-09-30 07:29, Marcin 'Qrczak' Kowalczyk wrote:

 Some time ago the Unicode Consortium slowly began switching to the
 point of view that abstract characters are denoted by numbers in the
 range U+..10.

 It's worth mentioning that these are 'codepoints', not 'characters'.

Yes, but characters are allocated to code points (or rather code positions).

 Sometimes a character will be made up of two codepoints, for instance an
 'a' with a dot above is a single character that can be made from the
 codepoints LATIN SMALL LETTER A and COMBINING DOT ABOVE.

Well, those ARE characters, which together form a GRAPHEME (which is
what Joe User would consider to be a character). Those two happen to
'combine' in NFC to LATIN SMALL LETTER A WITH DOT ABOVE.
But that is just that example. LATIN SMALL LETTER R and COMBINING
SHORT STROKE OVERLAY (yes, this is used in some places, but will never get
a precomposed character) are left as is also for NFC. Both of these examples,
for either normal form, MAY each be handled by one (ligature, if you like) glyph or
by two (overlaid) glyphs by a font.

Further, some code points are permanently reserved for UTF-16 surrogates,
some are permanently reserved as non-characters(!), some are for
private use (which can be used for things not yet formally encoded,
or things that never will be encoded) and quite a lot are reserved for
future standardisation.

The 8, 16, or 32-bit units in the encoding forms are called 'code units'.
E.g. Java's 'char' type is for UTF-16 code units, not characters!
Though a single UTF-16 code unit can represent a character in the BMP
(if that code position has a character allocated to it). In many cases, but
definitely not all, a single character, in its string context, is a grapheme too.

In summary:

code position (=code point): a value between  and 10.

code unit: a fixed bit-width value used in one of the encoding forms
(often called char in programming languages).

character: hard to give a proper definition (the 10646 one does not
say anything), but in brief roughly a thing deemed worthy of being
added to the repertiore of 10646.

grapheme: a sequence of one or more characters that naïve users
think of as a character (may be language dependent).

glyph: a piece of graphic that may image part of, a whole, or several
characters in context.  It is highly font dependent how the exact mapping
from characters to positioned glyphs is done.  (The partioning into
subglyphs, if done, need not be tied to Unicode decomposition.)
For most scripts, including Latin, this mapping is rather complex
(and is yet to be implemented in full).

 Perhaps this
 makes the UTF-16 'surrogate' problem a bit less serious, since there
 never was a one-to-one correspondence between any kind of n-bit unit and
 displayed characters.

With that I agree.

Kind regards
/kent k



 --
 Ashley Yakeley, Seattle WA



___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: Unicode support

2001-10-09 Thread Kent Karlsson


- Original Message -
From: Ashley Yakeley [EMAIL PROTECTED]
To: Kent Karlsson [EMAIL PROTECTED]; Haskell List [EMAIL PROTECTED]; 
Libraries for Haskell List
[EMAIL PROTECTED]
Sent: Tuesday, October 09, 2001 12:27 PM
Subject: Re: Unicode support


 At 2001-10-09 02:58, Kent Karlsson wrote:

 In summary:
 
 code position (=code point): a value between  and 10.

 Would this be a reasonable basis for Haskell's 'Char' type?

Yes.  It's essentially UTF-32, but without the fixation to 32-bit
(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
to 10 instead of 31(!) bits) is the datatype used in some
implementations of C for wchar_t.  As I said in another e-mail,
if one does not have high efficiency concerns, UTF-32 is a rather
straighforward way of representing characters.

 At some point
 perhaps there should be a 'Unicode' standard library for Haskell. For
 instance:

 encodeUTF8 :: String - [Word8];
 decodeUTF8 :: [Word8] - Maybe String;
 encodeUTF16 :: String - [Word16];
 decodeUTF16 :: [Word16] - Maybe String;

 data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
 getGeneralCategory :: Char - Maybe GeneralCategory;

There is not really any Maybe just there.  Yet unallocated code
positions have general category Cn (so do non-characters):
  Cs Other, Surrogate
  Co Other, Private Use
  Cn Other, Not Assigned (yet)


 ...sorting  searching...

 ...canonicalisation...

 etc. Lots of work for someone.

Yes.  And it is lots of work (which is why I'm not volonteering
to make a qick fix: there is no quick fix).

Kind regards
/kent k



___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: Unicode support

2001-10-09 Thread Kent Karlsson


- Original Message -
From: Ketil Malde [EMAIL PROTECTED]
...
  for a long time. 16 bit unicode should be gotten rid of, being the worst
  of both worlds, non backwards compatable with ascii, endianness issues
  and no constant length encoding utf8 externally and utf32 when
  worknig with individual characters is the way to go.

  I totally agree with you.

  Now, what are your technical arguments for this position?
  (B.t.w., UTF-16 isn't going to go away, it's very firmly established.)

 What's wrong with the ones already mentioned?

 You have endianness issues, and you need to explicitly type text files
 or insert BOMs.

You have to distinguish between the encoding form (what you use internally)
and encoding scheme (externally).  For the encoding form, there is no endian
issue, just like there is no endian issue for int internally in your program.
For the encoding form there is no BOM either (or rather, it should have been
removed upon reading, if the data is taken in from an external source).

But I agree that the BOM (for all of the Unicode encoding schemes) and
the byte order issue (for the non-UTF-8 encoding schemes; the external ones)
are a pain.  But as I said: they will not go away now, they are too firmly established.

 An UTF-8 stream limited to 7-bit ASCII simply is that ASCII stream.

Which is a large portion of the raison d'être for UTF-8.

 When not limited to ASCII, at least it avoids zero bytes and other
 potential problems.  UTF-16 will among other things, be full of
 NULLs.

Yes, and so what?

So will a file filled with image data, video clips, or plainly a list of raw
integers dumped to file (not formatted as strings).  I know, many old
utility programs choke on NULL bytes, but that's not Unicode's fault.
Further, NULL (as a character) is a perfectly valid character code.
Always was.

 I can understand UCS-2 looking attractive when it looked like a
 fixed-length encoding, but that no longer applies.

  So it is not surprising that most people involved do not consider
  UTF-16 a bad idea.  The extra complexity is minimal, and further
  surfaces rarely.

 But it needs to be there.  It will introduce larger programs, more
 bugs

True.  But implementing normalisation, or case mapping for that matter,
is non-trivial too.  In practice, the additional complexity with UTF-16 seems small.


 , lower efficiency.

Debatable.

  BMP characters are still (relatively) easy to process, and it saves
  memory space and cache misses when large amounts of text data
  is processed (e.g. databases).

 I couldn't find anything about the relative efficiencies of UTF-8 and
 UTF-16 on various languages.  Do you have any pointers?  From a
 Scandinavian POV, (using ASCII plus a handful of extra characters)
 UTF-8 should be a big win, but I'm sure there are counter examples.

So, how big is our personal hard disk now? 3GiB? 10GiB? How many images,
mp3 files and video clips do you have?  (I'm sorry, but your argument here
is getting old and stale.  Very few worry about that aspect anymore. Except
when it comes to databases stored in RAM and UTF-16 vs. UTF-32 which
is guaranteed to be wasteful.)


Kind regards
/kent k





___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe



Re: Unicode

2001-10-08 Thread Kent Karlsson


- Original Message -
From: Ketil Malde [EMAIL PROTECTED]
To: Dylan Thurston [EMAIL PROTECTED]
Cc: Andrew J Bromage [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
[EMAIL PROTECTED]
Sent: Monday, October 08, 2001 9:02 AM
Subject: Re: UniCode

(The spelling is 'Unicode' (and none other).)

 Dylan Thurston [EMAIL PROTECTED] writes:

  Right.  In Unicode, the concept of a character is not really so
  useful;

 After reading a bit about it, I'm certainly confused.
 Unicode/ISO-10646 contains a lot of things that aren'r really one
 character, e.g. ligatures.

The ligatures that are included are there for compatiblity with older
character encodings.  Normally, for modern technology..., ligatures
are (to be) formed automatically through the font.  OpenType (OT,
MS and Adobe) and AAT (Apple) have support for this. There are
often requests to add more ligatures to 10646/Unicode, but they are
rejected since 10646/Unicode encode characters, not glyphs. (With
two well-known exceptions: for compatibility, and certain dingbats.)

  most functions that traditionally operate on characters (e.g.,
  uppercase or display-width) fundamentally need to operate on strings.
  (This is due to properties of particular languages, not any design
  flaw of Unicode.)

 I think an argument could be put forward that Unicode is trying to be
 more than just a character set.  At least at first glance, it seems to

Yes, but:

 try to be both a character set and a glyph map, and incorporate things

not that. See above.

 like transliteration between character sets (or subsets, now that
 Unicode contains them all), directionality of script, and so on.

Unicode (but not 10646) does handle bidirectionality
(seeUAX 9: http://www.unicode.org/unicode/reports/tr9/), but not transliteration.
(Tranliteration is handled in IBMs ICU, though: 
http://www-124.ibm.com/developerworks/oss/icu4j/index.html)


toUpper, toLower - Not OK.  There are cases where upper casing a
   character yields two characters.

 I though title case was supposed to handle this.  I'm probably
 confused, though.

The titlecase characters in Unicode are (essentially) only there
for compatibility reasons (originally for transliterating between
certain subsets of Cyrillic and Latin scripts in a 1-1 way).  You're
not supposed to really use them...

The cases where toUpper of a single character give two characters
is for some (classical) Greek, where a builtin subscript iota turn into
a capital iota, and other cases where there is no corresponding
uppercase letter.

It is also the case that case mapping is context sensitive.  E.g.
mapping capital sigma to small sigma (mostly) or ς (small final sigma)
(at end of word), or the capital i to ı (small dotless i), if Turkish, or insert/
delete combining dot above for i and j in Lithuanian. See UTR 21
and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt.


  etc.  Any program using this library is bound to get confused on
  Unicode strings.  Even before Unicode, there is much functionality
  missing; for instance, I don't see any way to compare strings using
  a localized order.

 And you can't really use list functions like length on strings,
 since one item can be two characters (Lj, ij, fi) and several items
 can compose one character (combining characters).

Depends on what you mean by lenght and character...
You seem to be after what is sometimes referred to as grapheme,
and counting those.  There is a proposal for a definition of
language independent grapheme (with lexical syntax), but I don't
think it is stable yet.

 And map (==) can't compare two Strings since, e.g. in the presence
 of combining characters.  How are other systems handling this?

I guess it is not very systematic.  Java and XML make the comparisons
directly by equality of the 'raw' characters *when* comparing identifiers/similar,
though for XML there is a proposal for early normalisation essentially to
NFC (normal form C).  I would have preferred comparing the normal forms
of the identifiers instead.  For searches, the recommendation (though I doubt
in practice yet) is to use a collation key based comparison. (Note that collation
keys are usually language dependent. More about collation in UTS 10,
http://www.unicode.org/unicode/reports/tr10/, and ISO/IEC 14651.)

What does NOT make sense is to expose (to a user) the raw ordering ()
of Unicode strings, though it may be useful internally.  Orders exposed to
people (or other systems, for that matter) that are't concerned with the
inner workings of a program should always be collation based.  (But that
holds for any character encoding, it's just more apparent for Unicode.)

 It may be that Unicode isn't flawed, but it's certainly extremely
 complex.  I guess I'll have to delve a bit deeper into it.

It's complex, but it is because the scripts of world are complex (and add
to that politics, as well as compatbility and implementation issues).

Kind regards
/kent k




Re: Unicode support

2001-10-08 Thread Kent Karlsson


- Original Message -
From: Wolfgang Jeltsch [EMAIL PROTECTED]
To: The Haskell Mailing List [EMAIL PROTECTED]
Sent: Thursday, October 04, 2001 8:47 PM
Subject: Re: Unicode support


 On Sunday, 30 September 2001 20:01, John Meacham wrote:
  sorry for the me too post, but this has been a major pet peeve of mine
  for a long time. 16 bit unicode should be gotten rid of, being the worst
  of both worlds, non backwards compatable with ascii, endianness issues
  and no constant length encoding utf8 externally and utf32 when
  worknig with individual characters is the way to go.

 I totally agree with you.

Now, what are your technical arguments for this position?
(B.t.w., UTF-16 isn't going to go away, it's very firmly established.)

From what I've seen, those who take the position you seem to
prefer, are people not very involved with Unicode and its implementation.
Whereas people that are so involved strongly prefer UTF-16.

Note that nearly no string operation of interest (and excepting low level
stuff, like buffer sizes, and copying) can be done on a string looking
at individual characters only.  Just about the only thing that sensibly
can be done on isolated characters is property interrogation.You
can't do case mapping of a string (involving Greek or Lithuanian text)
without being sensitive to the context of each character.  And, as
somebody already noted, combining characters have to be taken
into account. E.g. Å (U+211B (deprecated), or U+00C5) must
collate the same as U+0041,U+030A, even when not collating
them among the A's (U+0041).

So it is not surprising that most people involved do not consider
UTF-16 a bad idea.  The extra complexity is minimal, and further
surfaces rarely.  Indeed they think UTF-16 is a good idea since the
supplementary characters will in most cases occur very rarely,
BMP characters are still (relatively) easy to process, and it saves
memory space and cache misses when large amounts of text data
is processed (e.g. databases).

On the other hand, Haskell implementations are probably still
rather wasteful when representing strings, and Haskell isn't used to hold
large databases, so going to UTF-32 is not a big deal for Haskell,
I guess. (Though I don't think that will happen for Java.)

  seeing as how the haskell standard is horribly vauge when it comes to
  character set encodings anyway, I would recommend that we just omit any
  reference to the bit size of Char, and just say abstractly that each
  Char represents one unicode character, but the entire range of unicode
  is not guarenteed to be expressable, which must be true, since haskell
  98 implementations can be written now, but unicode can change in the
  future. The only range guarenteed to be expressable in any
  representation are the values 0-127 US ASCII (or perhaps latin1)

 This sounds also very good.

Why?  This is the approach taken by programming languages like C,
where the character encoding *at runtime* (both for char and wchar_t)
is essentially unknown.  This, of course, leads to all sorts of trouble,
which some try to mitigate by *suggesting* to have all sorts of locale
independent stuff in (POSIX) locales. Nobody has worked out any
sufficiently comprehensive set of data for this though, and nobody ever
will, both because it is openended and because nobody is really trying.
Furthermore, this is not the approach of Java, Ada, or Haskell.  And it is
not the approach advocated by people involved with inplementing
support for Unicode (and other things related to internationalisation
and localisation). Even C is (slowly) leaving that approach, having
introduced the __STDC_ISO_10646__ property macro (with it's semantics),
and the \u and \U 'universal character names.

Kind regards
/kent k



___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: Unicode support

2001-10-08 Thread Kent Karlsson


- Original Message -
From: Dylan Thurston [EMAIL PROTECTED]
To: John Meacham [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Friday, October 05, 2001 5:47 PM
Subject: Re: Unicode support


 On Sun, Sep 30, 2001 at 11:01:38AM -0700, John Meacham wrote:
  seeing as how the haskell standard is horribly vauge when it comes to
  character set encodings anyway, I would recommend that we just omit any
  reference to the bit size of Char, and just say abstractly that each
  Char represents one unicode character, but the entire range of unicode
  is not guarenteed to be expressable, which must be true, since haskell
  98 implementations can be written now, but unicode can change in the
  future. The only range guarenteed to be expressable in any
  representation are the values 0-127 US ASCII (or perhaps latin1)

 I agree about the vagueness, but I believe the Unicode consortium has
 explicitly limited itself to 21 bits; if they turn out to have been

In some sense yes, but not quite.  It's better to say that the code space
is from  to 10, then the encoding forms handle the bits.

 lying about that (which seems unlikely in this millenium), we can

The guesstimate (originally) of less than half a millon things to encode
as characters has been stable for over a decade. Even though some
try to argue that Unicode had to go from 16-bit to more to be able
to handle more characters, that was really known from the beginning.
That there was a big bump recently adding 41000 Hàn characters that
was collected over a long time and, though some more Hàn are expected,
no such big bump.  If you're interested, it's gone beyond a guesstimate
now, see the roadmap:
http://www.evertype.com/standards/iso10646/ucs-roadmap.html
(the official version is at the DKUUG site, but the reference is through a
cryptic document number).  You will see how plane 1 is planned for
a number of historical scripts (mostly). Disregarding the private use
planes (15 and 16) there is nothing planned for planes 3-14, except for
some crap in 14 (what is there is there for political reasons only, DO NOT
USE), and that plane 2 may spill over into plane 3. That leaves ten planes
(of 64K code positions each) completely empty, with nothing planned for them.

Kind regards
/kent k


 hardly be blamed for believing them.  I think all that should be
 required of implementations is that they support 21 bits.

 Best,
 Dylan Thurston

 ___
 Haskell mailing list
 [EMAIL PROTECTED]
 http://www.haskell.org/mailman/listinfo/haskell


___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: Unicode

2001-10-08 Thread Kent Karlsson


- Original Message -
From: Dylan Thurston [EMAIL PROTECTED]
To: Andrew J Bromage [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Friday, October 05, 2001 6:00 PM
Subject: Re: UniCode


 On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote:
  G'day all.
 
  On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:
 
   Why Char is 32 bit. UniCode characters is 16 bit.
 
  It's not quite as simple as that.  There is a set of one million
  (more correctly, 1M) Unicode characters which are only accessible
  using surrogate pairs (i.e. two UTF-16 codes).  There are currently
  none of these codes assigned, and when they are, they'll be extremely
  rare.  So rare, in fact, that the cost of strings taking up twice the
  space that the currently do simply isn't worth the cost.

 This is no longer true, as of Unicode 3.1.  Almost half of all
 characters currently assigned are outside of the BMP (i.e., require
 surrogate pairs in the UTF-16 encoding), including many Chinese
 characters.  In current usage, these characters probably occur mainly
 in names, and are rare, but obviously important for the people
 involved.

In plane 2 (one of the surrogate planes) there are about 41000
Hàn characters, in addition to the about 27000 Hàn characters
in the BMP.  And more are expected to be encoded.  However,
IIRC, only about 6000-7000 of them are in modern use.

I don't really want to push for them (since I think they are a major design
mistake), but some people like them: the mathematical alphanumerical
characters in plane 1.  There are also the more likable (IMHO)
musical characters in plane 1 (western, though that attribute was
removed, and Bysantine!). (You cannot set a musical score in
Unicode plain text, it just encodes the characters that you can use IN
a musical score.)

...
   isAscii, isLatin1 - OK
Yes, but why do (or, rather, did) you want them; isLatin1 in particuar?
Then what about isCP1252 (THE most common encoding today),
isShiftJis, etc., for several hundered encodings? (I'm not proposing to
remove isAscii, but isLatin1 is dubious.)

   isControl - I don't know about this.
Why do (did) you want it? There are several kinds of control characters
in Unicode: the traditional C0 and (less used) C1 ones, format control
characters (NO, they do NOT control FORMATTING, though they do control
FORMAT, like cursive connections), ...

   isPrint - Dubious.  Is a non-spacing accent a printable character?
A combining character is most definitely printable. (There is a difference
between non-spacing and combining, even though many combining
characters are non-spacing, not all of them are.)

   isSpace - OK, by the comment in the report: The isSpace function
 recognizes only white characters in the Latin-1 range.
Sigh. There are several others, most importantly: LINE SEPARATOR,
PARAGRAPH SEPARATOR, and IDEOGRAPHIC SPACE.  And the
NEL in the C1 range.

   isUpper, isLower - Maybe OK.
This is property interrogation. There are many other properties of interest.

   toUpper, toLower - Not OK.  There are cases where upper casing a
  character yields two characters.
See my other e-mail.

 etc.  Any program using this library is bound to get confused on
 Unicode strings.  Even before Unicode, there is much functionality
 missing; for instance, I don't see any way to compare strings using
 a localized order.

 Is anyone working on honest support for Unicode, in the form of a real
 Unicode library with an interface at the correct level?

Well, IBM's ICU, for one, ...  But they only do it for C/C++/Java, not for Haskell...

Kind regards
/kent k



___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe



SV: Haskell 1.4 and Unicode

1997-11-10 Thread Kent Karlsson [EMAIL PROTECTED]

Hi!

1.  I don't seem to get my messages to this list
echoed back to me...  (Which I consider a bug.)

2.  As I tried to explain in detail in my previous message, 
(later) options 1 and 2 **do not make any sense**.  
Option 3 makes at least some sense, even though it
has some problems.  You could generalize option 4
to make sense too.
The layout rule does not generalise well.  I still
think that one should not give up entirely on it.  One
   way may be to require that "where", and other layout
   starters, are to have only spaces (U+0020),
   no-break spaces (U+00A0) and tabs (U+0009) in
   front of them on the same line, keeping the width
   rule for the tabs relative to the spaces.  (I know,
   present Haskell programs are not written that way.)

3. (In reply to Hans Aberg (Aberg?))
  The easiest way of thinking of Unicode is perhaps as a font
encoding; a
 font using this encoding would add such things as typeface
family, style,
 size, kerning (but Unicode probably does not have ligatures),
etc., which

   As everyone (getting) familiar with Unicode should
   know, Unicode is **NOT** a font encoding.
   It is a CHARACTER encoding.  The difference
   shows up mostly for 'complex scripts', such as Arabic
   and Devanagari (used for Hindi), but also in the processing
   of combining characters for 'latin'.  Glyph (at a "font
point")
   selection is based also on *neighbouring* characters.

   Unicode does have a number of compatability characters,
   but the explicit intent is that they should only be used
   for backwards compatability reasons.

/kent k

PS
B.t.w. Did you know...  that CR and LF should not be used
in "newly produced" Unicode texts.  One should use Line
Separator (U+2028) and Paragraph Separator (U+2029)
instead.  Line Separator is the one expected to be used
in program source files.




 -Ursprungligt meddelande-
 Fran: John C. Peterson [SMTP:[EMAIL PROTECTED]]
 Skickat:  den 8 november 1997 03:25
 Till: [EMAIL PROTECTED]
 Kopia:[EMAIL PROTECTED]; [EMAIL PROTECTED]
 Amne: Re: Haskell 1.4 and Unicode
 
 I had option 1 in mind when that part of the report was written.  We
 should clarify this in the next revision.
 
 And thanks for your analysis of the problem!
 
John
 
 






SV: Haskell 1.4 and Unicode

1997-11-10 Thread Kent Karlsson [EMAIL PROTECTED]

Let me reiterate:

Unicode is ***NOT*** a glyph encoding!

Unicode is ***NOT*** a glyph encoding!

and never will be.  The same character can be displayed as
a variety of glyphs, depending not only of the font/style,
but also, and this is the important point, on the characters
surrounding a particular instance of the character.  Also,
a sequence of characters can be displayed as a single glyph,
and a character can be displayed as a sequence of glyphs.
Which will be the case, is often font dependent.

This is not something unique to Unicode.  It is
just that most people are used to ASCII, Latin-1 and similar,
where the distinction between characters and glyphs is
blurred.

I would be interested in knowing why you think
"the idea of it as a character encoding thoroughly
breaks down in a mathematical context".  Deciding
what gets encoded as a character is more an
international social process than a mathematical
process...

/kent k

PS This may be getting too much into Unicode
to fit for the Haskell list...  In particular any argumentation
regarding the last paragraph above should *not* be sent to
the Haskell list, but could be sent to me personally.

PPS I don't know what you mean by "semantics of glyphs".

Hans Aberg wrote:
   I leave it to the experts to figure out what exactly Unicode is. I
 can
 only note that the idea of it as a character encoding thoroughly
 breaks
 down in a mathematical context. I think the safest thing is to only
 regard
 it as a set of glyphs, which are better, because ampler, than other
 encodings. I think figuring out the exact involved semantics of those
 glyphs is a highly complex issue which cannot fully be resolved.
 





Re: Haskell 1.4 and Unicode

1997-11-07 Thread Kent Karlsson

Carl R. Witty wrote:

 1) I assume that layout processing occurs after Unicode preprocessing;
 otherwise, you can't even find the lexemes.  If so, are all Unicode
 characters assumed to be the same width?

Unicode characters ***cannot in any way*** be considered as being of
the same display width.  Many characters have intrinsic width properties,
like "halfwidth Katakana", "fullwidth ASCII", "ideographic space",
"thin space", "zero width space", and so on (most of which are
compatability characters, i.e. present only for conversion reasons).
But more importantly there are combining characters which "modify"
a "base character". For instance A (A with ring above) can be given
as an A followed by a combining ring above, i.e. two Unicode characters.
(For this and many others there is also a 'precomposed' character.) 
For many scripts vowels are combining characters.  And there may be an
indefinitely long (in principle, but three is a lot) sequence of
combining characters after each non-combining character.

What about bidirectional scripts?  Especially for the Arabic
script which is a cursive (joined) script, where in addition
vowels are combining characters.

Furthermore, Unicode characters in the "extended range" (no characters
allocated yet) are encoded using two *non-character* 16-bit codes
(when using UTF-16, which is the preferred encoding for Unicode).

What would "Unicode preprocessing" be?  UTF-16 decoding?
Java-ish escape sequence decoding?

...
 3) What does it mean that Char can include any Unicode character?

I think it *does not* mean that a Char can hold any Unicode 
character.  I think it *does* means that it can hold any single
(UTF-16) 16-bit value.  Which is something quite different.  To store
an arbitrary Unicode character 'straight off', one would need up
to at least 21 bits to cover the UTF-16 range.  ISO/IEC 10646-1 allows
for up to 31 bits, but nobody(?) is planning to need all that.
Some use 32-bit values to store Unicode characters.  Perfectly
allowed by 10646, though not by Unicode proper.  Following Unicode
proper one would always use sequence of UTF-16 codes, in order to
be able to treat a "user perceived character" as a single entity
both for UTF-16 reasons, and also for combining sequences reasons,
independently of how the "user perceived character" was given as
Unicode characters.

/kent k

PS
Java gets some Unicode things wrong too.  Including that Java's
UTF-8 encoding is non-conforming (to both Unicode 2.0 and ISO/IEC
10646-1 Amd. 2).






Re: Int overflow

1997-10-30 Thread Kent Karlsson

This is my third resend of this message.  Previous (partial?) failures
appear to be due to that "reply" cannot be used and/or MIME attachments
cannot be used.  Apologies to anyone seeing this message for the
umteenth time.  (And this is the *only* mailing list that I have trouble
with...)

    /Kent Karlsson


Dave Tweed wrote:

 agree. Surely the best idea is to do something equivalent to the IEEE
 floating point standard which defines certain returned bit patterns to
 mean `over/underflow occurred', etc. The programmer can then handle this either
 in the simple way of calling error, or try to carry on in some suitable
 way. In a similar way there could be a tainted bit pattern for
 overflow, perhaps with testing functions built into the prelude. This
 would be even more useful since tainted bit-patterns in further
 calculations are defined to produce a tainted bit-pattern result, so
 overflow needn't be explicitly tested for each atomic operation.

I would just like to point out that IEEE 754 (a.k.a. IEC 559) does **NOT**
have any "tainted bit pattern"/"special value"/whatever-you-want-to-
call-it for overflow.  What IEC 559 DOES specify is:

1. There should be the values positive and negative infinity
   (it also specifies which bit patterns to use).  These do
   **NOT** mean that there was an overflow.  They may be exact
   values.  Infinity arguments do NOT guarantee infinity results.
   E.g. 1/+infinity returns +0 (without any underflow or other
   'notification').

Haskell: It would thus be an error [sic] to always call 'error' when
an infinity is seen.

2. When rounding to nearest (and only then, other rounding modes
   are available) and overflow is *not* trapped, negative overflow
   returns negative infinity and positive overflow returns positive
   infinity.  The default according to IEEE 745 is non-trapping.
   The default rounding is round-to-nearest.

3. When overflow is not trapped, an overflow sets a "sticky bit".

To get hold of the "sticky bits" (and maybe save or reset them) in Haskell
may be difficult.  They are intended for imperative handling.

4. There are "tainted bit patterns" ((quiet) NaNs) to be returned
   when an invalid operation occurred, e.g. 0/0, unless "invalid"
   is trapped.  NaNs are propagated the way you suggest for almost
   all functions (there are suggestions to ignore NaNs in certain
   circumstances).  B.t.w., 1/(-0), e.g., is not 'invalid', it is
   a 'divide-by-zero' and returns -infinity.

All of this is for floating point types and is commonly implemented. 

No *similar* standard exists for integer types.  What does exist
in term of standards for int(eger) arithmetic on computers, 
ISO/IEC 10967-1:1994, Language Independent Arithmetic, part 1
(LIA-1), only specifies what is currently commonly implemented,
and does not attempt to impose "new" requirements on integer
types. Overflow checking is, unfortunately, optional, and there
are no specifications for integer NaNs or integer infinities.
(There is nothing stopping them either, but without hardware
support, their implementation is likely to be comparatively slow.)

That said, I think for "int" in Haskell overflow checking should
be done for all the present "int" functions that "can" overflow
(+, -, *, ^, ...).  Special wrapping functions for some of these
(call them, say: +:, -:, *:, ...), should be added (in some library)
for those rare instances where wrapping is what is desired.

Ada, by comparison, have separate types for "overflowing" integers
(like 'Integer' or 'type Foo is range 0..2**16-1;'), where +, -, etc.
overflow when appropriate, and "modulo" integers (like 'type Bar
is mod 2**16;'), where the result of doing +, -, etc. is computed
modulo the "size" of the type.

R.
/kent k


PS

There are no *signed* "wrapping" integer types in Ada, a.f.a.I.k.

By "functions" in point 4, I was referring to certain "standard" functions
that take floating point argument(s) and returns a floating point result.





Re: Polymorphic recursion

1993-12-10 Thread Kent Karlsson



 Dear people interested in Haskell 1.3,
Disclaimer: I'm *not* a member of any "Haskell 1.3" committee,
if any such committee has been formed.

 One modest extension we could make to the Haskell type system is

   to permit polymorphic recursion if 

   a type signature is provided

I agree that this would be a good idea!  Both for the reason you give
and the reason below:

  Having done this change, one could (should!) remove section 4.5.1
(Dependency Analysis). This would have the consequence that some
more type signatures may sometimes be required when using version 1.3
compared to using version 1.2.  I don't think that would to too bad...
   To get consistency between implementations one should (instead!)
require that a declaration group is type checked in its entirety,
*not* splitting it up into smaller declaration groups, even
when possible.
   The reason for removing section 4.5.1 (except that I don't like it) is:
  

 Even though the split up of let-expressions into declaration cliques
 can be expressed as a source code transformation, the same cannot be
 done for where-declarations (modules, classes, instances, value-
 declaration clauses, case-clauses).  The latter two can be expressed
 as source code transformations, but only after doing other source
 transformations making then into let-expressions (including trans-
 forming away guards).  These transformations may not be desirable in
 all implementations. In particular it may make it hard to produce good
 type error messages.  For modules, classes (with default declarations),
 and instances the split into declaration cliques cannot be expressed
 as a *source* transformation.

(Note that classes and instances already have type signatures, so there
would be no need to add any extra type signatures in these cases.)

   So if we (you!) permit the use of a polymorphic function at different,
smaller or equal (rather than just equal), instances of the type
of the function within the declaration group *if* a type signature
is provided, then we can actually get a *simpler* type system!
That is, the requirement to transform to declaration cliques before
type checking can be removed, a transformation that cannot always be
expressed as a source transformation.  We get the added benefit
of being able to write recursive functions for which the trans-
formation to declaration cliques does not do the trick.

   Also, giving type signatures and checking that the types are a
fixed point is more obviously correct than deriving some (non-greatest!)
fixed-point types for a recursive declaration group.  Deriving the
*greatest* fixed point types would of course be ideal, if that had
been decidable.  But since it isn't (the greatest fixed point types
may be infinite), I support Simon's proposal.

/kent k




Re: re. 1.3 cleanup: patterns in list comprehensions

1993-10-15 Thread Kent Karlsson



   Patterns and
expressions can look very much alike.  Could one possibly expand "exp" to
"if exp" in Haskell 1.3 list comprehensions?  Only to make deterministic
parsing easier...
 

 One should not make the parsing method too much influence the language
 design, PASCAL is a bad example for this.

True.

 I once had the same problem when writing a Miranda-ish compiler.
 The simple solution is to use "exp" instead of "pat" in the parsing
 grammar (in qualifiers), and when it later (e.g. encountering - )
 becomes clear that the beast has got to be a pattern,
 you check it semantically.  This works, because patterns
 are syntactically a subclass of expressions.

False.  Patterns (in Haskell) also have "_", "~", and "@" which are
not allowed in expressions.  Using exp instead of pat (and adding
"_", "~", and "@" to expressions) is a hack, not a proper solution.
I don't like hacks.  So, I either have to massage the grammar into
deterministic LR parsable form (difficult) or use a nondeterministic
LR parser (not readily available).

 This extra effort in parser hacking is a small, one-time effort,
 compared to the really hard stuff in the compiler!

True. I'm not going to insist on a change.

/kent k




Re: re. 1.3 cleanup: patterns in list comprehensions

1993-10-14 Thread Kent Karlsson


 On the other hand, I think that the  pat=expr  syntax was something of a
 design error and it may not be supported in future releases.  Judging from
 email that I've received, the similarity of == and = does cause confusion.
 In fact, it has also caught me on at least one occassion!  (So yes, my
 experience is somewhat at odds with Nikhil's here.)  As a result, Gofer 2.28
 supports an alternative (and more general) syntax, with qualifiers of the
 form  let {decls}  and a semantics given by:

   [ e | let { decls } ]   =   [ let { decls } in e ]

 Parsing this doesn't cause any conflicts with standard Haskell syntax (as far
 as I can tell), and the braces can usually be omitted so there isn't a big
 syntactic overhead.

Parsing Haskell list comprehensions deterministically ((LA)LR) is currently very
hard, since both "pat - exp" (or also "pat gd - exp", as suggested by Thomas)
and "exp" are allowed as qualifiers in a list comprehension. Patterns and
expressions can look very much alike.  Could one possibly expand "exp" to
"if exp" in Haskell 1.3 list comprehensions?  Only to make deterministic
parsing easier...

/kent k




Re: + and -: syntax wars!

1993-05-27 Thread Kent Karlsson


Oops, PreludeCore cannot be hidden.  I guess I've made a fool of myself
(but that happens often :-).

 Can't we find anything more interesting to discuss that the syntax??
You are welcome to! :-)   But sweeping syntax matters under the carpet
does not improve anything. 


 |  ... But what I find a bit strange is that even when + and -
 | are overridden locally n+k and prefix - still have their old meanings.
 | Well, it's just one more exception to the rule to remember about Haskell.
 Yes, but we need to emphasize that rebinding such operators is a Bad Idea.
 (Maybe Phil is right, that we should simply forbid it.)

   I agree that it should be forbidden, not for the love of prohibitions,
but in order to detect more errors in programs statically, and to avoid
some quite unnecessary ways to muddle a Haskell program.  But there are
several degrees to which rebinding could be forbidden. Here are some
of the alternatives (sorry if you find this confusing/confused :-):

1. Forbidding rebinding + and -.
2. Forbidding rebinding operators/function names exported from
   classes in PreludeCore.
(Except in instance declarations, of course.)
3. Forbidding rebinding operators/function names declared by
   classes in scope.
(Except...)
4. Forbidding rebinding any name exported by PreludeCore.
5. Forbidding rebinding any name in scope.

I don't like singling out +, -, and PreludeCore more than necessary, so
alternative 3 (plus remark below) or 5 are good candidates in my opinion.

   I still think that Lennarts quiz declaration should be illegal at least on the
grounds Paul gave (i.e., even if the name (+) is replaced by some other name):
Names bound by the "lhs"es (in each let/where declaration part)
should not be allowed to be rebound by some argument pattern
within one of the "funlhs"es in the declaration. 


Syntactically confused
/kent k




Re: Division, remainder, and rounding functions

1992-02-17 Thread Kent Karlsson


Thanks Joe!  I still don't know why anyone would want
the 'divTruncateRem' function and its derivatives, but ok,
leave them there.  Why not add division with "rounding"
AWAY from zero as well. :-)

/kent k

(I've sent some detail comments directly to Joe.)