Hi all, Thomas Schilling wrote:
> OK, I agree that breaking text books is a big deal. On the other > hand, the lack of a good Text data type forced text books to teach bad > approaches to dealing with strings. Haskell should do better. As far as I know, none of the introductory Haskell text books has the ambition of teaching serious text processing in Haskell. And what they do for simple text processing for purpose of illustration is no worse than what one typically would do in, say, an introduction to programming using any language, like C or Java. So I don't buy that argument per se. But I do agree, of course, that a good library for text processing, and with adequate language support for making it convenient to use, is important. > Johan mentioned both semantic and performance problems with Strings. > A part he didn't stress is that Strings are also a horribly > memory-inefficient way of storing strings. On 64 bit GHC systems a > single ASCII character needs 16 bytes of memory (i.e., an overhead of > 16x). A non-ASCII character (ord c > 255) actually requires 32 bytes. > (This is due to a de-duplication optimisation in the GHC GC). Other > implementations may do better, but an abstract type would still be > better to enable more freedom for implementors. Sure it's inefficient. I doubt the above is news to anyone on this list. The point, though, is that once we're at the level of applications, in most cases, this inefficiency is negligible. And in the cases where it is not, the programmer will be well aware of this and pick a better representation, or will learn about it the hard way and be forced to pick a better representation. Just as with processing of significant amounts of *any* data. It simply isn't the case that the Haskell world magically would be significantly better of in terms of performance of only everyone was forced to use something like Text instead of String = [Char]. Moreover, the above analysis is unnecessarily pessimistic for one (somewhat important case: string literals. Thanks to Haskell being lazy, it is very easy if one really worry (for an implementor) to arrange that string literals are stored very compactly in a binary, only to be materialized when (and if) actually used. (I did just that years ago in the Freja compiler: memory was significantly smaller in those days, so I did worry :-) > Correct handling of unicode strings is a Hard Problem and String = > [Char] is only better if you ignore all the issues (which is certainly > fine a teaching environment). Yes. Unicode is unfortunately (partly but not exclusively out of necessity), very complicated. I doubt one would want to discuss this in depth in any introductory programming course. My point was that String = [Char] is fine as far as it goes. Not that it should be the basis for serious string processing libraries. > I would be happy to have a simplistic String = [Char] coexist with a > Text type if it weren't for the problem that so many things are biased > towards String. E.g., error takes a String, Yes. That's a bias. But is it a problem? Here we're just talking about getting a sequence of (possibly unicode) characters to stderr. > Show is used everywhere and produces strings, Show and Read are mainly used for simplistic serialisation and deserialisation. When ppl really care, they tend to use more refined approaches, e.g. proper scanners and parsers, or binary I/O. So again, while there is certainly a bias, it doesn't seem like a genuine problem in most cases. I can possibly see issues for conversion from and to e.g. built-in numeric types and various string representations, but I can't see why solving those would necessitate getting rid of String = [Char]. Read and Show could be overloaded on the string type, for example (at least given multi-parameter type classes), and/or a bit of compiler optimization ought to be enough to dispatch such uses of "read" and "show" to appropriate primitives of e.g. the Text library anyway. > the pretty printing library uses Strings, But that is a library issue, not a language issue. > Read parses Strings. See above. The special status of Read and Show is questionable anyway. Will hopefully be possible at some point to implement those completely as libraries. So I'm not not overly swayed by the argument of language bias in those cases. > As I said, while I'm not a huge fan of having two String types > co-exist, I could accept it as a necessary trade-off to keep text > books valid and preserve backwards compatibility. While an undue proliferation of string types would be unfortunate, compared with the plethora of other representational choices one is faced with when it comes to e.g. numeric types, arrays, maps, etc., a couple of string types doesn't seem like a big deal, especially not if one is designated the default choice for any program that will do non-trivial text processing or aims at doing internationalisation properly. > (There are also other issues with String. For example, you can't > write an instance MyClass String in Haskell2010, and even with GHC > extensions it seems wrong and you often end up writing instances that > overlap with MyClass [a].) I'm using Data.Text a lot, so I can work > around the issue, but unfortunately you run into a lot of issues > where the standard library forces the use of String, and that, I > believe, is wrong. > > If changing the standard library is the bigger issue, however, then > I'm not sure whether this discussion needs to take place on the > haskell-prime list or on the libraries list. Indeed. Maybe all that's really needed at the language level is to standardize overloading of string literals? (In a way that avoids issues like the ones described above.) Best, /Henrik -- Henrik Nilsson School of Computer Science The University of Nottingham n...@cs.nott.ac.uk _______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime