Re: [Haskell-cafe] Re: Roman Numerals and Haskell Syntax abuse
now 134 roman=(!6);n!a|n1=|n=t=s!!a:(n-t)!a|c=t=s!!(2*e):c!a|10=n!(a-1)where(d,m)=a`divMod`2;e=d+m-1;s=ivxlcdm;c=10^e+n;t=10^d*(1+4*m) Gosh! Anyway, you missed the roman symbols for 5000 (U+2181) and 1 (U+2182)... ;-) The ones for 5 and 10 aren't in Unicode yet, nor is the canopy used to write even larger values (see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2738). /Kent K ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
RE: Why are strings linked lists?
GHC 6.2 (shortly to be released) also supports toUpper, toLower, and the character predicates isUpper, isLower etc. on the full Unicode character set. There is one caveat: the implementation is based on the C library's towupper() and so on, so the support is only as good as the C library provides, and it relies on wchar_t being equivalent to Unicode (the sensible choice, but not all libcs do this). Now, why would one want to base this on C's wchar_t and its w routines? wchar_t is sometimes (isolated) UTF-32 code units, including in Linux, sometimes it is (isolated) UTF-16 code units, including in Windows, and sometimes something utterly useless. The casing data is not reliable (it could be entirely wrong, and even locale dependent in an erroneous way), nor kept up to date with the Unicode character database in all implementations (even where wchar_t is some form of Unicode/10646). wchar_t is best forgotten, especially for portable programs. Please instead use ICU's UChar32, which is (isolated) UTF-32, and and Unicode::isUpperCase(cp), Unicode::toUpperCase(cp) (C++ here), etc. The ICU data is kept up-to-date with Unicode versions. The case mappings are the simplistic ones, not taking SpecialCasing.txt into account, just the UnicodeData.txt case mapping data. It is thus not locale dependent, nor context dependent, nor doesn't cae-map a character to more than one character (so it is not fully appropriate for strings, but still much, much better than C's wchar_t and its w-functions). Proper support for character set conversions in the I/O library has been talked about for some time, and there are a couple of implementations One can base this on the ICU character encoding conversions. I would very much recommend that over the C locale dependent mb conversion routines, for the same reasons as above. /kent k ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
RE: Language-Independent Arithmetic
Alastair Reid wrote: On Sunday 22 June 2003 6:30 am, Ashley Yakeley wrote: From the Haskell 98 Report, sec. 6.4: The default floating point operations defined by the Haskell Prelude do not conform to current language independent arithmetic (LIA) standards. These standards require considerably more complexity in the numeric structure and have thus been relegated to a library. Is this true? Which library? If I recall correctly, the LIA standard requires control over rounding modes, No, it does not. But IEC 60559 (a.k.a. IEEE 754, or IEEE f.p. arithmetic) does. (Side remark: the quote above refers only to LIA-1. LIA-2 (elementary functions) is now done, and LIA-3 (complex integers and complex floating point) is in the works. All three LIA parts are relevant to Haskell, since Haskell includes elementary functions and complex floating point values and operations.) requires that you provide several variants of each comparision operation which respond differently to +0,-0,infinity,NaN, etc. No it does not. Nor does IEC 60559, while it informatively suggests the possibility w.r.t NaNs (not w.r.t. signed zeroes or infinities); I don't think that suggestion has been picked up by anyone though. The C committee considered it, but apparently rejected it. I think some of the obvious type signatures would have to change too. How? I haven't been looking into this for quite a while, but I don't recall any such problem. There are a few missing constants, and a few missing operations, though. A major problem, however, is error handling. While LIA allows for write error message and terminate, that's rarely the best way of handling arithmetic errors. The recording of indicators approach, much like IEC 60559 default error handling, is problematic in Haskell due to the hidden state. What would be needed to conform to LIA would be to add a library providing all the operations. The default ops (i.e., the Prelude) would still not conform to LIA but that may not be such a big deal. It is the intent for LIA-1 that most programming languages (and their implementations) should be able to conform to LIA-1 without too much trouble. Even if it means terminate on error in a conforming mode of operation. /Kent Karlsson (Current editor of the LIA series of standards.) -- Alastair Reid ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
RE: gcd 0 0 = 0
Let me try again: greatest - maximum/supremum of a set of integers (plain everyday order) common - intersection (plain everyday intersection of sets) divisor (of an integer value v) - an integer value m, such that v/m is defined and, if so, is an integer factor (of an integer value v) - an integer value m, such that there is an integer value n such that m*n=v So (mock Haskell syntax; set expression really): greatest_common_divisor a b = max (intersection {all divisors of a} {all divisors of b}) What is the supremum (result of max in the expression above) if a and b are both 0? (You're allowed to use values not prescribed by Haskell to exist. ;-) (You can replace divisors by factors in that expression and still get the same result.) I may agree that an operation *similar* to gcd, where 0,0 as argument returns 0 is useful (maybe even more useful than gcd!). But that operation is still not the gcd (and might even return other results thaN gcd also for other value pairs than 0,0; in particlar negatives; depending on what is found most useful). If you want to replace gcd by some other, similar, operation, please go ahead. But call it something else, because it is something else. If you want to generalise that to polynomials or Gaussian integers (or at least imaginary integers, as opposed to complex integers), fine (though not for the current standard Haskell library). (Micheal, I am interested in the Guassian integer variety of this. If you like, you can expand on what you said in an off-list message, or give me a reference.) Kind (and somewhat fundamentalist) regards /kent k -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jan de Wit Sent: den 19 december 2001 01:15 To: [EMAIL PROTECTED] Subject: Re: gcd 0 0 = 0 Why not define gcd a b as the largest (in 'normal' order) integer d such that the set of sums of multiples of a and b {na+mb | n - Z, m - Z} is equal to the set of multiples of d {nd | n - Z}? Easy to understand, no talk of division, lattices, rings, ideals etcetera, and it covers the cases with 0. Cheers, Jan de Wit ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
RE: gcd 0 0 = 0
Simon == Simon Peyton-Jones [EMAIL PROTECTED] writes: Simon Christoph does not like this I still don't like this. 0 has never, and will never, divide anything, in particular not 0. 0 may be a prime factor of 0 (see also below!), but that is different. It is not the greatest (in the ordinary sense) divisor of 0. Indeed, +infinity is a much larger divisor of 0... I'm not in favour of using a very special-purpose order, not used for anything else, and that isn't even an order but a preorder, just to motivate gcd 0 0 = 0. Even if using this very special-purpose preorder, an infinity would be included in the 'top' equivalence class, and if we pick a representative value on the basis of which is 'greater' in the ordinary sense for integers augmented with infinities(!), then +infinity should be the representative value. Thus, in any case, gcd 0 0 = +infinity. This is easy enough for Integer, where +infinity and -infinity can easily be made representable (and should be made representable), but harder for a 'pure hardware' Int datatype. But in an ideal world, gcd 0 0 = +infinity with no error or exception. It's OK if the definition is clear; it wasn't using the words positive or greatest integer. Stating gcd 0 0 = 0 explicitly is a good thing, even if it could be expressed verbatim; people may think about the mathematical background, but they should not need to think about the meaning of the definition. Anyway, I'm still against promoting 1 to a prime number :-) Why? If EVERY natural number is to have a prime factorisation, then BOTH 0 AND 1 have to be promoted to prime numbers; otherwise 1 and 0 cannot be prime factorised; in addition to that 1 is then a prime factor of any number (that can be excluded from the *minimal* list of prime factors except for 1)... There is no fundamental reason to except 1 from being a prime number. But there is a fundamental reason to say that 0 can never be a divisor (i.e. 0|0 is false; x|y is true iff x is a *non-zero* factor of y; the 'non-zero' part is often left implicit (e.g. one is only talking about strictly positive integers), which is part of the reason why we are having this discussion). If you want something similar to gcd, but that returns 0 for 0 and 0, then it is the 'product of all common prime factors'; where 1 has the (non-minimal) prime factorisation [1, 1, ...], 0 has the (non-minimal) prime factorisation [0, 1, 2, ...], and 1 is included at least once in the (non-minimal) prime factorisation of any natural number. If you want a parallel to the divides relation where 0 and 0 are related: 0 is a factor of 0. A prime number is a number that has no integer *between* 1 and itself as factors. People often say except instead of between, but that does not work for 0, nor for the non-minimal prime factorisations that people seem to be interested in, given the interest in having gcd 0 0 = 0 (which isn't the gc*d*!). Again, the context is often strictly positive integers, and 'between' and 'except' are then equivalent. For no apparent reason 1 is usually also excepted, but that does not work for the prime factorisation of 1, nor for finding the product of all common prime factors of 1 and another natural number... For integers, -1 is also a prime number, and for imaginary integers, i is also a prime number... I'm sure somebody can give a nice definition of a partial order (not just preorder) lattice with 1 as the min value and 0 as the max value (just larger than the infinities), if you absolutely want a lattice with a gcd-*like* meet and lcm-*like* join for this (the, positive bias, factor-of order). I'd be happy to support such gcd-*like* (pcf?) and lcm-*like* functions, but they aren't the gcd, nor the lcm (e.g. pcf (-1) (-1) = -1, not 1, etc.). If you don't like adding these, then I suggest leaving things completely as they are. Squeezing in two operations into one just because they have the same results over the first quadrant is not something I find to be too good. Odd one out? /kent k ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
RE: GCD
I don't think preorders of any kind should be involved here. Just the ordinary order on integers. No divisibility preorder (I'm not sure how that is even defined, so how it could be natural beats me), no absolute value. I find the unaltered text Simon quoted to be fine as is. But for those who like to be more precise (forgive the TeXese): % Most of you may wish to stop reading at this point. % I is the set of integers representable in the integral datatype. % result_I may return overflow or the argument, as appropriate. \begin{example}\atab $gcd_I : I \times I \rightarrow I \cup \{\overflow, \infinitary\}$ \end{example} \begin{example}\atab $gcd_I(x,y)$ \$= result_I(\max\{v \in \ZZ ~~|~~ v|x $ and $v|y\})$\\ \ \if $x,y \in I$ and ($x \neq 0$ or $y \neq 0$)\\ \$= \infinitary(\posinf)$ \if $x = 0$ and $y = 0$ \end{example} % There is no need to say v0 above, since there are always positive values in that % set, and max picks the largest/greatest one. 0 has all integer values except(!) 0 % as divisors. So for gcd 0 0 (maximum, supremum really, of the intersection of the two % sets of divisors) the result is really positive infinity, which should be the result % returned when representable (recommendable for Haskell's Integer datatype). gcd will % overflow for instances like gcd (minBound::Int) (minBound::Int). \begin{example}\atab\\ $lcm_I : I \times I \rightarrow I \cup \{\overflow\}$ \end{example} \begin{example}\atab $lcm_I(x,y)$ \$= result_I(\min\{v \in \ZZ ~~|~~ x|v $ and $ y|v $ and $ v 0\})$\\ \ \if $x,y \in I$ and $x \neq 0$ and $y \neq 0$\\ \$= 0$ \if $x,y \in I$ and ($x = 0$ or $y = 0$) \end{example} % the v0 is needed here, since the set here would otherwise always contain % infinitely many negative values, and then minimum of that... Kind regards /kent k -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of S.M.Kahrs Sent: den 11 december 2001 11:21 To: [EMAIL PROTECTED] Subject: Re: GCD The natural reading of 'greatest' is, of course, the greatest in the divisibility preorder (it's partial order on natural numbers but only a preorder on integers). Thus, gcd 0 0 = 0. 3 and -3 are equivalent in that preoder. Thus, an additional comment may be in order. Stefan ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: Haskell 98 - Standard Prelude - Floating Class
- Original Message - From: Jerzy Karczmarczuk [EMAIL PROTECTED] ... Simon Peyton-Jones: Russell O'Connor suggests: | but sinh and cosh can easily be defined in terms of exp | | sinh x = (exp(x) - exp(-x))/2 | cosh x = (exp(x) + exp(-x))/2 | I suggest removing sinh and cosh from the minimal complete | definition, and add the above defaults. This looks pretty reasonable to me. We should have default methods for anything we can. Comments? Three. 1. Actually, I wouldn't even call that default definitions. These ARE definitions of sinh and cosh. Mathematically, yes. Numerically, no. Even if 'exp' is implemented with high accuracy, the suggested defaults may return a very inaccurate (in ulps) result. Take sinh near zero. sinh(x) with x very close to 0 should return x. With the above 'default' sinh(x) will return exactly 0 for a relatively wide interval around 0, which is the wrong result except for 0 itself. In general, this is why LIA-2 (Language Independent Arithmetic, part 2, Elementary numerical functions, ISO/IEC 10967-2:2001) rarely attempts to define one numerical operation in terms of other numerical operations. That is done only when the relationship is exact (even if the operations themselves are inexact). That is not the case for the abovementioned operations. But it is the case for the relationship between the complex sin operation and the complex sinh operation, for instance. (Complex will be covered by LIA-3.) Kind regards /Kent Karlsson ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: Unicode support
Just to clear up any misunderstanding: - Original Message - From: Ashley Yakeley [EMAIL PROTECTED] To: Haskell List [EMAIL PROTECTED] Sent: Monday, October 01, 2001 12:36 AM Subject: Re: Unicode support At 2001-09-30 07:29, Marcin 'Qrczak' Kowalczyk wrote: Some time ago the Unicode Consortium slowly began switching to the point of view that abstract characters are denoted by numbers in the range U+..10. It's worth mentioning that these are 'codepoints', not 'characters'. Yes, but characters are allocated to code points (or rather code positions). Sometimes a character will be made up of two codepoints, for instance an 'a' with a dot above is a single character that can be made from the codepoints LATIN SMALL LETTER A and COMBINING DOT ABOVE. Well, those ARE characters, which together form a GRAPHEME (which is what Joe User would consider to be a character). Those two happen to 'combine' in NFC to LATIN SMALL LETTER A WITH DOT ABOVE. But that is just that example. LATIN SMALL LETTER R and COMBINING SHORT STROKE OVERLAY (yes, this is used in some places, but will never get a precomposed character) are left as is also for NFC. Both of these examples, for either normal form, MAY each be handled by one (ligature, if you like) glyph or by two (overlaid) glyphs by a font. Further, some code points are permanently reserved for UTF-16 surrogates, some are permanently reserved as non-characters(!), some are for private use (which can be used for things not yet formally encoded, or things that never will be encoded) and quite a lot are reserved for future standardisation. The 8, 16, or 32-bit units in the encoding forms are called 'code units'. E.g. Java's 'char' type is for UTF-16 code units, not characters! Though a single UTF-16 code unit can represent a character in the BMP (if that code position has a character allocated to it). In many cases, but definitely not all, a single character, in its string context, is a grapheme too. In summary: code position (=code point): a value between and 10. code unit: a fixed bit-width value used in one of the encoding forms (often called char in programming languages). character: hard to give a proper definition (the 10646 one does not say anything), but in brief roughly a thing deemed worthy of being added to the repertiore of 10646. grapheme: a sequence of one or more characters that naïve users think of as a character (may be language dependent). glyph: a piece of graphic that may image part of, a whole, or several characters in context. It is highly font dependent how the exact mapping from characters to positioned glyphs is done. (The partioning into subglyphs, if done, need not be tied to Unicode decomposition.) For most scripts, including Latin, this mapping is rather complex (and is yet to be implemented in full). Perhaps this makes the UTF-16 'surrogate' problem a bit less serious, since there never was a one-to-one correspondence between any kind of n-bit unit and displayed characters. With that I agree. Kind regards /kent k -- Ashley Yakeley, Seattle WA ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: Unicode support
- Original Message - From: Ashley Yakeley [EMAIL PROTECTED] To: Kent Karlsson [EMAIL PROTECTED]; Haskell List [EMAIL PROTECTED]; Libraries for Haskell List [EMAIL PROTECTED] Sent: Tuesday, October 09, 2001 12:27 PM Subject: Re: Unicode support At 2001-10-09 02:58, Kent Karlsson wrote: In summary: code position (=code point): a value between and 10. Would this be a reasonable basis for Haskell's 'Char' type? Yes. It's essentially UTF-32, but without the fixation to 32-bit (21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited to 10 instead of 31(!) bits) is the datatype used in some implementations of C for wchar_t. As I said in another e-mail, if one does not have high efficiency concerns, UTF-32 is a rather straighforward way of representing characters. At some point perhaps there should be a 'Unicode' standard library for Haskell. For instance: encodeUTF8 :: String - [Word8]; decodeUTF8 :: [Word8] - Maybe String; encodeUTF16 :: String - [Word16]; decodeUTF16 :: [Word16] - Maybe String; data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ... getGeneralCategory :: Char - Maybe GeneralCategory; There is not really any Maybe just there. Yet unallocated code positions have general category Cn (so do non-characters): Cs Other, Surrogate Co Other, Private Use Cn Other, Not Assigned (yet) ...sorting searching... ...canonicalisation... etc. Lots of work for someone. Yes. And it is lots of work (which is why I'm not volonteering to make a qick fix: there is no quick fix). Kind regards /kent k ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: Unicode support
- Original Message - From: Ketil Malde [EMAIL PROTECTED] ... for a long time. 16 bit unicode should be gotten rid of, being the worst of both worlds, non backwards compatable with ascii, endianness issues and no constant length encoding utf8 externally and utf32 when worknig with individual characters is the way to go. I totally agree with you. Now, what are your technical arguments for this position? (B.t.w., UTF-16 isn't going to go away, it's very firmly established.) What's wrong with the ones already mentioned? You have endianness issues, and you need to explicitly type text files or insert BOMs. You have to distinguish between the encoding form (what you use internally) and encoding scheme (externally). For the encoding form, there is no endian issue, just like there is no endian issue for int internally in your program. For the encoding form there is no BOM either (or rather, it should have been removed upon reading, if the data is taken in from an external source). But I agree that the BOM (for all of the Unicode encoding schemes) and the byte order issue (for the non-UTF-8 encoding schemes; the external ones) are a pain. But as I said: they will not go away now, they are too firmly established. An UTF-8 stream limited to 7-bit ASCII simply is that ASCII stream. Which is a large portion of the raison d'être for UTF-8. When not limited to ASCII, at least it avoids zero bytes and other potential problems. UTF-16 will among other things, be full of NULLs. Yes, and so what? So will a file filled with image data, video clips, or plainly a list of raw integers dumped to file (not formatted as strings). I know, many old utility programs choke on NULL bytes, but that's not Unicode's fault. Further, NULL (as a character) is a perfectly valid character code. Always was. I can understand UCS-2 looking attractive when it looked like a fixed-length encoding, but that no longer applies. So it is not surprising that most people involved do not consider UTF-16 a bad idea. The extra complexity is minimal, and further surfaces rarely. But it needs to be there. It will introduce larger programs, more bugs True. But implementing normalisation, or case mapping for that matter, is non-trivial too. In practice, the additional complexity with UTF-16 seems small. , lower efficiency. Debatable. BMP characters are still (relatively) easy to process, and it saves memory space and cache misses when large amounts of text data is processed (e.g. databases). I couldn't find anything about the relative efficiencies of UTF-8 and UTF-16 on various languages. Do you have any pointers? From a Scandinavian POV, (using ASCII plus a handful of extra characters) UTF-8 should be a big win, but I'm sure there are counter examples. So, how big is our personal hard disk now? 3GiB? 10GiB? How many images, mp3 files and video clips do you have? (I'm sorry, but your argument here is getting old and stale. Very few worry about that aspect anymore. Except when it comes to databases stored in RAM and UTF-16 vs. UTF-32 which is guaranteed to be wasteful.) Kind regards /kent k ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Unicode
- Original Message - From: Ketil Malde [EMAIL PROTECTED] To: Dylan Thurston [EMAIL PROTECTED] Cc: Andrew J Bromage [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Monday, October 08, 2001 9:02 AM Subject: Re: UniCode (The spelling is 'Unicode' (and none other).) Dylan Thurston [EMAIL PROTECTED] writes: Right. In Unicode, the concept of a character is not really so useful; After reading a bit about it, I'm certainly confused. Unicode/ISO-10646 contains a lot of things that aren'r really one character, e.g. ligatures. The ligatures that are included are there for compatiblity with older character encodings. Normally, for modern technology..., ligatures are (to be) formed automatically through the font. OpenType (OT, MS and Adobe) and AAT (Apple) have support for this. There are often requests to add more ligatures to 10646/Unicode, but they are rejected since 10646/Unicode encode characters, not glyphs. (With two well-known exceptions: for compatibility, and certain dingbats.) most functions that traditionally operate on characters (e.g., uppercase or display-width) fundamentally need to operate on strings. (This is due to properties of particular languages, not any design flaw of Unicode.) I think an argument could be put forward that Unicode is trying to be more than just a character set. At least at first glance, it seems to Yes, but: try to be both a character set and a glyph map, and incorporate things not that. See above. like transliteration between character sets (or subsets, now that Unicode contains them all), directionality of script, and so on. Unicode (but not 10646) does handle bidirectionality (seeUAX 9: http://www.unicode.org/unicode/reports/tr9/), but not transliteration. (Tranliteration is handled in IBMs ICU, though: http://www-124.ibm.com/developerworks/oss/icu4j/index.html) toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters. I though title case was supposed to handle this. I'm probably confused, though. The titlecase characters in Unicode are (essentially) only there for compatibility reasons (originally for transliterating between certain subsets of Cyrillic and Latin scripts in a 1-1 way). You're not supposed to really use them... The cases where toUpper of a single character give two characters is for some (classical) Greek, where a builtin subscript iota turn into a capital iota, and other cases where there is no corresponding uppercase letter. It is also the case that case mapping is context sensitive. E.g. mapping capital sigma to small sigma (mostly) or ς (small final sigma) (at end of word), or the capital i to ı (small dotless i), if Turkish, or insert/ delete combining dot above for i and j in Lithuanian. See UTR 21 and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt. etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order. And you can't really use list functions like length on strings, since one item can be two characters (Lj, ij, fi) and several items can compose one character (combining characters). Depends on what you mean by lenght and character... You seem to be after what is sometimes referred to as grapheme, and counting those. There is a proposal for a definition of language independent grapheme (with lexical syntax), but I don't think it is stable yet. And map (==) can't compare two Strings since, e.g. in the presence of combining characters. How are other systems handling this? I guess it is not very systematic. Java and XML make the comparisons directly by equality of the 'raw' characters *when* comparing identifiers/similar, though for XML there is a proposal for early normalisation essentially to NFC (normal form C). I would have preferred comparing the normal forms of the identifiers instead. For searches, the recommendation (though I doubt in practice yet) is to use a collation key based comparison. (Note that collation keys are usually language dependent. More about collation in UTS 10, http://www.unicode.org/unicode/reports/tr10/, and ISO/IEC 14651.) What does NOT make sense is to expose (to a user) the raw ordering () of Unicode strings, though it may be useful internally. Orders exposed to people (or other systems, for that matter) that are't concerned with the inner workings of a program should always be collation based. (But that holds for any character encoding, it's just more apparent for Unicode.) It may be that Unicode isn't flawed, but it's certainly extremely complex. I guess I'll have to delve a bit deeper into it. It's complex, but it is because the scripts of world are complex (and add to that politics, as well as compatbility and implementation issues). Kind regards /kent k
Re: Unicode support
- Original Message - From: Wolfgang Jeltsch [EMAIL PROTECTED] To: The Haskell Mailing List [EMAIL PROTECTED] Sent: Thursday, October 04, 2001 8:47 PM Subject: Re: Unicode support On Sunday, 30 September 2001 20:01, John Meacham wrote: sorry for the me too post, but this has been a major pet peeve of mine for a long time. 16 bit unicode should be gotten rid of, being the worst of both worlds, non backwards compatable with ascii, endianness issues and no constant length encoding utf8 externally and utf32 when worknig with individual characters is the way to go. I totally agree with you. Now, what are your technical arguments for this position? (B.t.w., UTF-16 isn't going to go away, it's very firmly established.) From what I've seen, those who take the position you seem to prefer, are people not very involved with Unicode and its implementation. Whereas people that are so involved strongly prefer UTF-16. Note that nearly no string operation of interest (and excepting low level stuff, like buffer sizes, and copying) can be done on a string looking at individual characters only. Just about the only thing that sensibly can be done on isolated characters is property interrogation.You can't do case mapping of a string (involving Greek or Lithuanian text) without being sensitive to the context of each character. And, as somebody already noted, combining characters have to be taken into account. E.g. Å (U+211B (deprecated), or U+00C5) must collate the same as U+0041,U+030A, even when not collating them among the A's (U+0041). So it is not surprising that most people involved do not consider UTF-16 a bad idea. The extra complexity is minimal, and further surfaces rarely. Indeed they think UTF-16 is a good idea since the supplementary characters will in most cases occur very rarely, BMP characters are still (relatively) easy to process, and it saves memory space and cache misses when large amounts of text data is processed (e.g. databases). On the other hand, Haskell implementations are probably still rather wasteful when representing strings, and Haskell isn't used to hold large databases, so going to UTF-32 is not a big deal for Haskell, I guess. (Though I don't think that will happen for Java.) seeing as how the haskell standard is horribly vauge when it comes to character set encodings anyway, I would recommend that we just omit any reference to the bit size of Char, and just say abstractly that each Char represents one unicode character, but the entire range of unicode is not guarenteed to be expressable, which must be true, since haskell 98 implementations can be written now, but unicode can change in the future. The only range guarenteed to be expressable in any representation are the values 0-127 US ASCII (or perhaps latin1) This sounds also very good. Why? This is the approach taken by programming languages like C, where the character encoding *at runtime* (both for char and wchar_t) is essentially unknown. This, of course, leads to all sorts of trouble, which some try to mitigate by *suggesting* to have all sorts of locale independent stuff in (POSIX) locales. Nobody has worked out any sufficiently comprehensive set of data for this though, and nobody ever will, both because it is openended and because nobody is really trying. Furthermore, this is not the approach of Java, Ada, or Haskell. And it is not the approach advocated by people involved with inplementing support for Unicode (and other things related to internationalisation and localisation). Even C is (slowly) leaving that approach, having introduced the __STDC_ISO_10646__ property macro (with it's semantics), and the \u and \U 'universal character names. Kind regards /kent k ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: Unicode support
- Original Message - From: Dylan Thurston [EMAIL PROTECTED] To: John Meacham [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, October 05, 2001 5:47 PM Subject: Re: Unicode support On Sun, Sep 30, 2001 at 11:01:38AM -0700, John Meacham wrote: seeing as how the haskell standard is horribly vauge when it comes to character set encodings anyway, I would recommend that we just omit any reference to the bit size of Char, and just say abstractly that each Char represents one unicode character, but the entire range of unicode is not guarenteed to be expressable, which must be true, since haskell 98 implementations can be written now, but unicode can change in the future. The only range guarenteed to be expressable in any representation are the values 0-127 US ASCII (or perhaps latin1) I agree about the vagueness, but I believe the Unicode consortium has explicitly limited itself to 21 bits; if they turn out to have been In some sense yes, but not quite. It's better to say that the code space is from to 10, then the encoding forms handle the bits. lying about that (which seems unlikely in this millenium), we can The guesstimate (originally) of less than half a millon things to encode as characters has been stable for over a decade. Even though some try to argue that Unicode had to go from 16-bit to more to be able to handle more characters, that was really known from the beginning. That there was a big bump recently adding 41000 Hàn characters that was collected over a long time and, though some more Hàn are expected, no such big bump. If you're interested, it's gone beyond a guesstimate now, see the roadmap: http://www.evertype.com/standards/iso10646/ucs-roadmap.html (the official version is at the DKUUG site, but the reference is through a cryptic document number). You will see how plane 1 is planned for a number of historical scripts (mostly). Disregarding the private use planes (15 and 16) there is nothing planned for planes 3-14, except for some crap in 14 (what is there is there for political reasons only, DO NOT USE), and that plane 2 may spill over into plane 3. That leaves ten planes (of 64K code positions each) completely empty, with nothing planned for them. Kind regards /kent k hardly be blamed for believing them. I think all that should be required of implementations is that they support 21 bits. Best, Dylan Thurston ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell ___ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
Re: Unicode
- Original Message - From: Dylan Thurston [EMAIL PROTECTED] To: Andrew J Bromage [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, October 05, 2001 6:00 PM Subject: Re: UniCode On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote: G'day all. On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote: Why Char is 32 bit. UniCode characters is 16 bit. It's not quite as simple as that. There is a set of one million (more correctly, 1M) Unicode characters which are only accessible using surrogate pairs (i.e. two UTF-16 codes). There are currently none of these codes assigned, and when they are, they'll be extremely rare. So rare, in fact, that the cost of strings taking up twice the space that the currently do simply isn't worth the cost. This is no longer true, as of Unicode 3.1. Almost half of all characters currently assigned are outside of the BMP (i.e., require surrogate pairs in the UTF-16 encoding), including many Chinese characters. In current usage, these characters probably occur mainly in names, and are rare, but obviously important for the people involved. In plane 2 (one of the surrogate planes) there are about 41000 Hàn characters, in addition to the about 27000 Hàn characters in the BMP. And more are expected to be encoded. However, IIRC, only about 6000-7000 of them are in modern use. I don't really want to push for them (since I think they are a major design mistake), but some people like them: the mathematical alphanumerical characters in plane 1. There are also the more likable (IMHO) musical characters in plane 1 (western, though that attribute was removed, and Bysantine!). (You cannot set a musical score in Unicode plain text, it just encodes the characters that you can use IN a musical score.) ... isAscii, isLatin1 - OK Yes, but why do (or, rather, did) you want them; isLatin1 in particuar? Then what about isCP1252 (THE most common encoding today), isShiftJis, etc., for several hundered encodings? (I'm not proposing to remove isAscii, but isLatin1 is dubious.) isControl - I don't know about this. Why do (did) you want it? There are several kinds of control characters in Unicode: the traditional C0 and (less used) C1 ones, format control characters (NO, they do NOT control FORMATTING, though they do control FORMAT, like cursive connections), ... isPrint - Dubious. Is a non-spacing accent a printable character? A combining character is most definitely printable. (There is a difference between non-spacing and combining, even though many combining characters are non-spacing, not all of them are.) isSpace - OK, by the comment in the report: The isSpace function recognizes only white characters in the Latin-1 range. Sigh. There are several others, most importantly: LINE SEPARATOR, PARAGRAPH SEPARATOR, and IDEOGRAPHIC SPACE. And the NEL in the C1 range. isUpper, isLower - Maybe OK. This is property interrogation. There are many other properties of interest. toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters. See my other e-mail. etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order. Is anyone working on honest support for Unicode, in the form of a real Unicode library with an interface at the correct level? Well, IBM's ICU, for one, ... But they only do it for C/C++/Java, not for Haskell... Kind regards /kent k ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
SV: Haskell 1.4 and Unicode
Hi! 1. I don't seem to get my messages to this list echoed back to me... (Which I consider a bug.) 2. As I tried to explain in detail in my previous message, (later) options 1 and 2 **do not make any sense**. Option 3 makes at least some sense, even though it has some problems. You could generalize option 4 to make sense too. The layout rule does not generalise well. I still think that one should not give up entirely on it. One way may be to require that "where", and other layout starters, are to have only spaces (U+0020), no-break spaces (U+00A0) and tabs (U+0009) in front of them on the same line, keeping the width rule for the tabs relative to the spaces. (I know, present Haskell programs are not written that way.) 3. (In reply to Hans Aberg (Aberg?)) The easiest way of thinking of Unicode is perhaps as a font encoding; a font using this encoding would add such things as typeface family, style, size, kerning (but Unicode probably does not have ligatures), etc., which As everyone (getting) familiar with Unicode should know, Unicode is **NOT** a font encoding. It is a CHARACTER encoding. The difference shows up mostly for 'complex scripts', such as Arabic and Devanagari (used for Hindi), but also in the processing of combining characters for 'latin'. Glyph (at a "font point") selection is based also on *neighbouring* characters. Unicode does have a number of compatability characters, but the explicit intent is that they should only be used for backwards compatability reasons. /kent k PS B.t.w. Did you know... that CR and LF should not be used in "newly produced" Unicode texts. One should use Line Separator (U+2028) and Paragraph Separator (U+2029) instead. Line Separator is the one expected to be used in program source files. -Ursprungligt meddelande- Fran: John C. Peterson [SMTP:[EMAIL PROTECTED]] Skickat: den 8 november 1997 03:25 Till: [EMAIL PROTECTED] Kopia:[EMAIL PROTECTED]; [EMAIL PROTECTED] Amne: Re: Haskell 1.4 and Unicode I had option 1 in mind when that part of the report was written. We should clarify this in the next revision. And thanks for your analysis of the problem! John
SV: Haskell 1.4 and Unicode
Let me reiterate: Unicode is ***NOT*** a glyph encoding! Unicode is ***NOT*** a glyph encoding! and never will be. The same character can be displayed as a variety of glyphs, depending not only of the font/style, but also, and this is the important point, on the characters surrounding a particular instance of the character. Also, a sequence of characters can be displayed as a single glyph, and a character can be displayed as a sequence of glyphs. Which will be the case, is often font dependent. This is not something unique to Unicode. It is just that most people are used to ASCII, Latin-1 and similar, where the distinction between characters and glyphs is blurred. I would be interested in knowing why you think "the idea of it as a character encoding thoroughly breaks down in a mathematical context". Deciding what gets encoded as a character is more an international social process than a mathematical process... /kent k PS This may be getting too much into Unicode to fit for the Haskell list... In particular any argumentation regarding the last paragraph above should *not* be sent to the Haskell list, but could be sent to me personally. PPS I don't know what you mean by "semantics of glyphs". Hans Aberg wrote: I leave it to the experts to figure out what exactly Unicode is. I can only note that the idea of it as a character encoding thoroughly breaks down in a mathematical context. I think the safest thing is to only regard it as a set of glyphs, which are better, because ampler, than other encodings. I think figuring out the exact involved semantics of those glyphs is a highly complex issue which cannot fully be resolved.
Re: Haskell 1.4 and Unicode
Carl R. Witty wrote: 1) I assume that layout processing occurs after Unicode preprocessing; otherwise, you can't even find the lexemes. If so, are all Unicode characters assumed to be the same width? Unicode characters ***cannot in any way*** be considered as being of the same display width. Many characters have intrinsic width properties, like "halfwidth Katakana", "fullwidth ASCII", "ideographic space", "thin space", "zero width space", and so on (most of which are compatability characters, i.e. present only for conversion reasons). But more importantly there are combining characters which "modify" a "base character". For instance A (A with ring above) can be given as an A followed by a combining ring above, i.e. two Unicode characters. (For this and many others there is also a 'precomposed' character.) For many scripts vowels are combining characters. And there may be an indefinitely long (in principle, but three is a lot) sequence of combining characters after each non-combining character. What about bidirectional scripts? Especially for the Arabic script which is a cursive (joined) script, where in addition vowels are combining characters. Furthermore, Unicode characters in the "extended range" (no characters allocated yet) are encoded using two *non-character* 16-bit codes (when using UTF-16, which is the preferred encoding for Unicode). What would "Unicode preprocessing" be? UTF-16 decoding? Java-ish escape sequence decoding? ... 3) What does it mean that Char can include any Unicode character? I think it *does not* mean that a Char can hold any Unicode character. I think it *does* means that it can hold any single (UTF-16) 16-bit value. Which is something quite different. To store an arbitrary Unicode character 'straight off', one would need up to at least 21 bits to cover the UTF-16 range. ISO/IEC 10646-1 allows for up to 31 bits, but nobody(?) is planning to need all that. Some use 32-bit values to store Unicode characters. Perfectly allowed by 10646, though not by Unicode proper. Following Unicode proper one would always use sequence of UTF-16 codes, in order to be able to treat a "user perceived character" as a single entity both for UTF-16 reasons, and also for combining sequences reasons, independently of how the "user perceived character" was given as Unicode characters. /kent k PS Java gets some Unicode things wrong too. Including that Java's UTF-8 encoding is non-conforming (to both Unicode 2.0 and ISO/IEC 10646-1 Amd. 2).
Re: Int overflow
This is my third resend of this message. Previous (partial?) failures appear to be due to that "reply" cannot be used and/or MIME attachments cannot be used. Apologies to anyone seeing this message for the umteenth time. (And this is the *only* mailing list that I have trouble with...) /Kent Karlsson Dave Tweed wrote: agree. Surely the best idea is to do something equivalent to the IEEE floating point standard which defines certain returned bit patterns to mean `over/underflow occurred', etc. The programmer can then handle this either in the simple way of calling error, or try to carry on in some suitable way. In a similar way there could be a tainted bit pattern for overflow, perhaps with testing functions built into the prelude. This would be even more useful since tainted bit-patterns in further calculations are defined to produce a tainted bit-pattern result, so overflow needn't be explicitly tested for each atomic operation. I would just like to point out that IEEE 754 (a.k.a. IEC 559) does **NOT** have any "tainted bit pattern"/"special value"/whatever-you-want-to- call-it for overflow. What IEC 559 DOES specify is: 1. There should be the values positive and negative infinity (it also specifies which bit patterns to use). These do **NOT** mean that there was an overflow. They may be exact values. Infinity arguments do NOT guarantee infinity results. E.g. 1/+infinity returns +0 (without any underflow or other 'notification'). Haskell: It would thus be an error [sic] to always call 'error' when an infinity is seen. 2. When rounding to nearest (and only then, other rounding modes are available) and overflow is *not* trapped, negative overflow returns negative infinity and positive overflow returns positive infinity. The default according to IEEE 745 is non-trapping. The default rounding is round-to-nearest. 3. When overflow is not trapped, an overflow sets a "sticky bit". To get hold of the "sticky bits" (and maybe save or reset them) in Haskell may be difficult. They are intended for imperative handling. 4. There are "tainted bit patterns" ((quiet) NaNs) to be returned when an invalid operation occurred, e.g. 0/0, unless "invalid" is trapped. NaNs are propagated the way you suggest for almost all functions (there are suggestions to ignore NaNs in certain circumstances). B.t.w., 1/(-0), e.g., is not 'invalid', it is a 'divide-by-zero' and returns -infinity. All of this is for floating point types and is commonly implemented. No *similar* standard exists for integer types. What does exist in term of standards for int(eger) arithmetic on computers, ISO/IEC 10967-1:1994, Language Independent Arithmetic, part 1 (LIA-1), only specifies what is currently commonly implemented, and does not attempt to impose "new" requirements on integer types. Overflow checking is, unfortunately, optional, and there are no specifications for integer NaNs or integer infinities. (There is nothing stopping them either, but without hardware support, their implementation is likely to be comparatively slow.) That said, I think for "int" in Haskell overflow checking should be done for all the present "int" functions that "can" overflow (+, -, *, ^, ...). Special wrapping functions for some of these (call them, say: +:, -:, *:, ...), should be added (in some library) for those rare instances where wrapping is what is desired. Ada, by comparison, have separate types for "overflowing" integers (like 'Integer' or 'type Foo is range 0..2**16-1;'), where +, -, etc. overflow when appropriate, and "modulo" integers (like 'type Bar is mod 2**16;'), where the result of doing +, -, etc. is computed modulo the "size" of the type. R. /kent k PS There are no *signed* "wrapping" integer types in Ada, a.f.a.I.k. By "functions" in point 4, I was referring to certain "standard" functions that take floating point argument(s) and returns a floating point result.
Re: Polymorphic recursion
Dear people interested in Haskell 1.3, Disclaimer: I'm *not* a member of any "Haskell 1.3" committee, if any such committee has been formed. One modest extension we could make to the Haskell type system is to permit polymorphic recursion if a type signature is provided I agree that this would be a good idea! Both for the reason you give and the reason below: Having done this change, one could (should!) remove section 4.5.1 (Dependency Analysis). This would have the consequence that some more type signatures may sometimes be required when using version 1.3 compared to using version 1.2. I don't think that would to too bad... To get consistency between implementations one should (instead!) require that a declaration group is type checked in its entirety, *not* splitting it up into smaller declaration groups, even when possible. The reason for removing section 4.5.1 (except that I don't like it) is: Even though the split up of let-expressions into declaration cliques can be expressed as a source code transformation, the same cannot be done for where-declarations (modules, classes, instances, value- declaration clauses, case-clauses). The latter two can be expressed as source code transformations, but only after doing other source transformations making then into let-expressions (including trans- forming away guards). These transformations may not be desirable in all implementations. In particular it may make it hard to produce good type error messages. For modules, classes (with default declarations), and instances the split into declaration cliques cannot be expressed as a *source* transformation. (Note that classes and instances already have type signatures, so there would be no need to add any extra type signatures in these cases.) So if we (you!) permit the use of a polymorphic function at different, smaller or equal (rather than just equal), instances of the type of the function within the declaration group *if* a type signature is provided, then we can actually get a *simpler* type system! That is, the requirement to transform to declaration cliques before type checking can be removed, a transformation that cannot always be expressed as a source transformation. We get the added benefit of being able to write recursive functions for which the trans- formation to declaration cliques does not do the trick. Also, giving type signatures and checking that the types are a fixed point is more obviously correct than deriving some (non-greatest!) fixed-point types for a recursive declaration group. Deriving the *greatest* fixed point types would of course be ideal, if that had been decidable. But since it isn't (the greatest fixed point types may be infinite), I support Simon's proposal. /kent k
Re: re. 1.3 cleanup: patterns in list comprehensions
Patterns and expressions can look very much alike. Could one possibly expand "exp" to "if exp" in Haskell 1.3 list comprehensions? Only to make deterministic parsing easier... One should not make the parsing method too much influence the language design, PASCAL is a bad example for this. True. I once had the same problem when writing a Miranda-ish compiler. The simple solution is to use "exp" instead of "pat" in the parsing grammar (in qualifiers), and when it later (e.g. encountering - ) becomes clear that the beast has got to be a pattern, you check it semantically. This works, because patterns are syntactically a subclass of expressions. False. Patterns (in Haskell) also have "_", "~", and "@" which are not allowed in expressions. Using exp instead of pat (and adding "_", "~", and "@" to expressions) is a hack, not a proper solution. I don't like hacks. So, I either have to massage the grammar into deterministic LR parsable form (difficult) or use a nondeterministic LR parser (not readily available). This extra effort in parser hacking is a small, one-time effort, compared to the really hard stuff in the compiler! True. I'm not going to insist on a change. /kent k
Re: re. 1.3 cleanup: patterns in list comprehensions
On the other hand, I think that the pat=expr syntax was something of a design error and it may not be supported in future releases. Judging from email that I've received, the similarity of == and = does cause confusion. In fact, it has also caught me on at least one occassion! (So yes, my experience is somewhat at odds with Nikhil's here.) As a result, Gofer 2.28 supports an alternative (and more general) syntax, with qualifiers of the form let {decls} and a semantics given by: [ e | let { decls } ] = [ let { decls } in e ] Parsing this doesn't cause any conflicts with standard Haskell syntax (as far as I can tell), and the braces can usually be omitted so there isn't a big syntactic overhead. Parsing Haskell list comprehensions deterministically ((LA)LR) is currently very hard, since both "pat - exp" (or also "pat gd - exp", as suggested by Thomas) and "exp" are allowed as qualifiers in a list comprehension. Patterns and expressions can look very much alike. Could one possibly expand "exp" to "if exp" in Haskell 1.3 list comprehensions? Only to make deterministic parsing easier... /kent k
Re: + and -: syntax wars!
Oops, PreludeCore cannot be hidden. I guess I've made a fool of myself (but that happens often :-). Can't we find anything more interesting to discuss that the syntax?? You are welcome to! :-) But sweeping syntax matters under the carpet does not improve anything. | ... But what I find a bit strange is that even when + and - | are overridden locally n+k and prefix - still have their old meanings. | Well, it's just one more exception to the rule to remember about Haskell. Yes, but we need to emphasize that rebinding such operators is a Bad Idea. (Maybe Phil is right, that we should simply forbid it.) I agree that it should be forbidden, not for the love of prohibitions, but in order to detect more errors in programs statically, and to avoid some quite unnecessary ways to muddle a Haskell program. But there are several degrees to which rebinding could be forbidden. Here are some of the alternatives (sorry if you find this confusing/confused :-): 1. Forbidding rebinding + and -. 2. Forbidding rebinding operators/function names exported from classes in PreludeCore. (Except in instance declarations, of course.) 3. Forbidding rebinding operators/function names declared by classes in scope. (Except...) 4. Forbidding rebinding any name exported by PreludeCore. 5. Forbidding rebinding any name in scope. I don't like singling out +, -, and PreludeCore more than necessary, so alternative 3 (plus remark below) or 5 are good candidates in my opinion. I still think that Lennarts quiz declaration should be illegal at least on the grounds Paul gave (i.e., even if the name (+) is replaced by some other name): Names bound by the "lhs"es (in each let/where declaration part) should not be allowed to be rebound by some argument pattern within one of the "funlhs"es in the declaration. Syntactically confused /kent k
Re: Division, remainder, and rounding functions
Thanks Joe! I still don't know why anyone would want the 'divTruncateRem' function and its derivatives, but ok, leave them there. Why not add division with "rounding" AWAY from zero as well. :-) /kent k (I've sent some detail comments directly to Joe.)