Ketil Malde wrote: > I haven't benchmarked it, but I'm fairly sure that, if you try to fit a > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of > RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language. > I think that *IF* we are aiming for a single, grand, unified text > library to Rule Them All, it needs to use UTF-8. Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias. In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future. > Alternatively, we > can have different libraries with different representations for > different purposes, where you'll get another few percent of juice by > switching to the most appropriate. > > Currently the latter approach looks to be in favor, so if we can't have > one single library, let us at least aim for a set of libraries with > consistent interfaces and optimal performance. Data.Text is great for > UTF-16, and I'd like to have something similar for UTF-8. Is all I'm > trying to say. I agree. Thanks, Yitz _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe