Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese.
-Andrew On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman <mich...@snoyman.com>wrote: > > > On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <g...@sefer.org> wrote: > >> Ketil Malde wrote: >> > I haven't benchmarked it, but I'm fairly sure that, if you try to fit a >> > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of >> > RAM, UTF-16 will be slower than UTF-8... >> >> I don't think the genome is typical text. And >> I doubt that is true if that text is in a CJK language. >> >> > I think that *IF* we are aiming for a single, grand, unified text >> > library to Rule Them All, it needs to use UTF-8. >> >> Given the growth rate of China's economy, if CJK isn't >> already the majority of text being processed in the world, >> it will be soon. I have seen media reports claiming CJK is >> now a majority of text data going over the wire on the web, >> though I haven't seen anything scientific backing up those claims. >> It certainly seems reasonable. I believe Google's measurements >> based on their own web index showing wide adoption of UTF-8 >> are very badly skewed due to a strong Western bias. >> >> In that case, if we have to pick one encoding for Data.Text, >> UTF-16 is likely to be a better choice than UTF-8, especially >> if the cost is fairly low even for the special case of Western >> languages. Also, UTF-16 has become by far the dominant internal >> text format for most software and for most user platforms. >> Except on desktop Linux - and whether we like it or not, Linux >> desktops will remain a tiny minority for the foreseeable future. >> >> I think you are conflating two points here, and ignoring some important > data. Regarding the data: you haven't actually quoted any statistics about > the prevalence of CJK data, but even if the majority of web pages served are > in those three languages, a fairly high percentage of the content will > *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd > hate to make up statistics on the spot, especially when I don't have any > numbers from you to compare them with. > > As far as the conflation, there are two questions with regard to the > encoding choice: encoding/decoding time and space usage. I don't think > *anyone* is asserting that UTF-16 is a common encoding for files anywhere, > so by using UTF-16 we are simply incurring an overhead in every case. We > can't consider a CJK encoding for text, so its prevalence is irrelevant to > this topic. What *is* relevant is that a very large percentage of web pages > *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by > default UTF-8. > > As far as space usage, you are correct that CJK data will take up more > memory in UTF-8 than UTF-16. The question still remains whether the overall > document size will be larger: I'd be interested in taking a random sampling > of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I > think simply talking about this in the vacuum of data is pointless. If > anyone can recommend a CJK website which would be considered representative > (or a few), I'll do the test myself. > > Michael > > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > >
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe