Hi, Thanks Hans, Qifan, Eric, Dave for your response, very detail and professional answer.
Firstly, I think I made a mistake on that 'abc你我' took 15 bytes space if column type is varchar(n chars) character set utf8. I did some test and it proved that I was totally wrong. Thanks for correct my understanding. >>create table test5( a varchar(20 bytes) character set utf8, b varchar(20 >>chars) character set utf8); >>insert into test5 values('abcde','abcde'); >>insert into test5 values('abc你我','abc你我'); >>select char_length(a),char_length(b) from test5; (EXPR) (EXPR) ---------- ---------- 5 5 9 9 Secondly, as you said, actually, for char(x) character set utf8, the disk storage is 4 bytes not 3. However, in above test, each Chinese consume only 3 bytes, so why do we design it up to 4 bytes disk storage? Is the extra 1 byte used for NULL value? Best regards, Yuan -----Original Message----- From: Hans Zeller [mailto:hans.zel...@esgyn.com] Sent: Thursday, December 07, 2017 2:02 AM To: dev@trafodion.incubator.apache.org Subject: RE: Is varchar(300 bytes) character set utf8 better than varchar(100 chars) character set utf8? Hi, Yes, I would say that "varchar(300 bytes) character set utf8" is preferable over "varchar(100 chars) character set utf8". Two reasons: First, the former can store more information, you can fit 300 ASCII characters while the second can fit only 100. Second, Trafodion actually assumes a maximum length of 4 bytes of a UTF-8 character, so a varchar(100 chars) has a maximum length of 400 bytes. That's different from Oracle, which assumes a maximum length of 3 bytes. Trafodion supports the full Unicode character set, not just the Basic Multilingual Plane (BMP). This is especially important when you define key columns, because in the HBase key, a varchar column is encoded with its maximal length. Thanks, Hans -----Original Message----- From: Dave Birdsall [mailto:dave.birds...@esgyn.com] Sent: Wednesday, December 6, 2017 9:05 AM To: dev@trafodion.incubator.apache.org Subject: RE: Is varchar(300 bytes) character set utf8 better than varchar(100 chars) character set utf8? Hi, My understanding is that we store varchar(whatever) character set utf8 as utf8 characters. So, English characters would take 1 byte while Chinese characters would take 3 bytes. Though in the old days working with Japanese Kanji character sets, I remember that there were "wide" versions of the English characters (so, in the SJIS character set, for example, there was a single-byte 'A' character as well as a two-byte wide 'A' character; these were different characters from the point of view of the character set); I don't know if UTF-8 has "wide" versions of English characters. I suspect not. So, the maximum length of a string in both varchar(100 chars) character set utf8 and varchar(300 bytes) character set utf8 is 300 bytes. There are some overheads with varchar(100 chars) that you don't have with varchar(300 bytes). Suppose you have a string of 200 English characters (say, 'aaaaaa....', 200 'a' characters repeated). You should get an error when you try to store this in varchar(100 chars) but not if you store it in varchar(300 bytes). Getting an error in the first case involves counting characters which requires a scan of the whole string. In the second case, we do not need to count characters. Dave -----Original Message----- From: Liu, Yuan (Yuan) [mailto:yuan....@esgyn.cn] Sent: Tuesday, December 5, 2017 11:48 PM To: dev@trafodion.incubator.apache.org Subject: Is varchar(300 bytes) character set utf8 better than varchar(100 chars) character set utf8? Hi Trafodioneers, I have a column that need to save 1 to 100 characters. The characters may include both English characters(such as ‘a’, ‘b’..) and Chinese characters(such as ‘你’,‘我’..) So I can define the column type as varchar(100 chars) character set utf8, in that case for each character, no matter it is English or Chinese, it will take 3 bytes space. For example, ‘abc你我’ will consume 15 bytes space. So I consider to change the column type to varchar(100 bytes) character set utf8, in that case ‘abc你我’ will consume 9 bytes space. However, string with more than 34 Chinese characters can not be inserted because 34 Chinese characters will take more than 100 bytes. The worst case is to define the column type as varchar(300 bytes) character set utf8. In general case, I think varchar(300 bytes) character set utf8 is better than varchar(100 chars) character set utf8 if not all characters are Chinese. Is that true? Best regards, Yuan