On Tue, Aug 17, 2010 at 3:23 PM, Yitzchak Gale <g...@sefer.org> wrote:
> Michael Snoyman wrote: > > Regarding the data: you haven't actually quoted any > > statistics about the prevalence of CJK data > > True, I haven't seen any - except for Google, which > I don't believe is accurate. I would like to see some > good unbiased data. > > Right now we just have our intuitions based on anecdotal > evidence and whatever years of experience we have in IT. > > For the anecdotal evidence, I really wish that people from > CJK countries were better represented in this discussion. > Unfortunately, Haskell is less prevalent in CJK countries, > and there is somewhat of a language barrier. > > > I'd hate to make up statistics on the spot, especially when > > I don't have any numbers from you to compare them with. > > I agree, I wish we had better numbers. > > > even if the majority of web pages served are > > in those three languages, a fairly high percentage > > of the content will *still* be ASCII, due simply to the HTML, > > CSS and Javascript overhead... > > As far as space usage, you are correct that CJK data will take up more > > memory in UTF-8 than UTF-16. The question still remains whether the > overall > > document size will be larger: I'd be interested in taking a random > sampling > > of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I > > think simply talking about this in the vacuum of data is pointless. If > > anyone can recommend a CJK website which would be considered > representative > > (or a few), I'll do the test myself. > > Again, I agree that some real data would be great. > > The problem is, I'm not sure if there is anyone in this discussion > who is qualified to come up with anything even close to a fair > random sampling or a CJK website that is representative. > As far as I can tell, most of us participating in this discussion > have absolutely zero perspective of what computing is like > in CJK countries. > > I won't call this a scientific study by any stretch of the imagination, but I did a quick test on the www.qq.com homepage. The original file encoding was GB2312; here are the file sizes: GB2312: 193014 UTF8: 200044 UTF16: 371938 > > As far as the conflation, there are two questions > > with regard to the encoding choice: encoding/decoding time > > and space usage. > > No, there is a third: using an API that results in robust, readable > and maintainable code even in the face of changing encoding > requirements. Unless you have proof that the difference in > performance between that API and an API with a hard-wired > encoding is the factor that is causing your particular application > to fail to meet its requirements, the hard-wired approach > is guilty of aggravated premature optimization. > > So for example, UTF-8 is an important option > to have in a web toolkit. But if that's the only option, that > web toolkit shouldn't be considered a general-purpose one > in my opinion. > > I'm not talking about API changes here; the topic at hand is the internal representation of the stream of characters used by the text package. That is currently UTF-16; I would argue switching to UTF8. > > I don't think *anyone* is asserting that > > UTF-16 is a common encoding for files anywhere, > > so by using UTF-16 we are simply incurring an overhead > > in every case. > > Well, to start with, all MS Word documents are in UTF-16. > There are a few of those around I think. Most applications - > in some sense of "most" - store text in UTF-16 > > Again, without any data, my intuition tells me that > most of the text data stored in the world's files are in > UTF-16. There is currently not much Haskell code > that reads those formats directly, but I think that will > be changing as usage of Haskell in the real world > picks up. > > I was referring to text files, not binary files with text embedded within them. While we might use the text package to deal with the data from a Word doc once in memory, we would almost certainly need to use ByteString (or binary perhaps) to actually parse the file. But at the end of the day, you're right: there would be an encoding penalty at a certain point, just not on the entire file. > We can't consider a CJK encoding for text, > > Not as a default, certainly not as the only option. But > nice to have as a choice. > > I think you're missing the point at hand: I don't think *any* is opposed to offering encoders/decoders for all the multitude of encoding types out there. In fact, I believe the text-icu package already supports every encoding type under discussion. The question is the internal representation for text, for which a language-specific encoding is *not* a choice, since it does not support all unicode code points. Michael
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe