Hi,

Thanks Hans, Qifan, Eric, Dave for your response, very detail and professional 
answer.

Firstly, I think I made a mistake on that 'abc你我' took 15 bytes space if column 
type is varchar(n chars) character set utf8. I did some test and it proved that 
I was totally wrong. Thanks for correct my understanding.

>>create table test5( a varchar(20 bytes) character set utf8, b varchar(20 
>>chars) character set utf8);
>>insert into test5 values('abcde','abcde');
>>insert into test5 values('abc你我','abc你我');
>>select char_length(a),char_length(b) from test5;
(EXPR)      (EXPR)
----------  ----------
         5           5
         9           9

Secondly, as you said, actually, for char(x) character set utf8, the disk 
storage is 4 bytes not 3. However, in above test, each Chinese consume only 3 
bytes, so why do we design it up to 4 bytes disk storage? Is the extra 1 byte 
used for NULL value?


Best regards,
Yuan

-----Original Message-----
From: Hans Zeller [mailto:hans.zel...@esgyn.com] 
Sent: Thursday, December 07, 2017 2:02 AM
To: dev@trafodion.incubator.apache.org
Subject: RE: Is varchar(300 bytes) character set utf8 better than varchar(100 
chars) character set utf8?

Hi,

Yes, I would say that "varchar(300 bytes) character set utf8" is preferable 
over "varchar(100 chars) character set utf8".

Two reasons: First, the former can store more information, you can fit 300 
ASCII characters while the second can fit only 100. Second, Trafodion actually 
assumes a maximum length of 4 bytes of a UTF-8 character, so a varchar(100 
chars) has a maximum length of 400 bytes. That's different from Oracle, which 
assumes a maximum length of 3 bytes. Trafodion supports the full Unicode 
character set, not just the Basic Multilingual Plane (BMP).

This is especially important when you define key columns, because in the HBase 
key, a varchar column is encoded with its maximal length.

Thanks,

Hans

-----Original Message-----
From: Dave Birdsall [mailto:dave.birds...@esgyn.com] 
Sent: Wednesday, December 6, 2017 9:05 AM
To: dev@trafodion.incubator.apache.org
Subject: RE: Is varchar(300 bytes) character set utf8 better than varchar(100 
chars) character set utf8?

Hi,

My understanding is that we store varchar(whatever) character set utf8 as utf8 
characters. So, English characters would take 1 byte while Chinese characters 
would take 3 bytes. 

Though in the old days working with Japanese Kanji character sets, I remember 
that there were "wide" versions of the English characters (so, in the SJIS 
character set, for example, there was a single-byte 'A' character as well as a 
two-byte wide 'A' character; these were different characters from the point of 
view of the character set); I don't know if UTF-8 has "wide" versions of 
English characters. I suspect not.

So, the maximum length of a string in both varchar(100 chars) character set 
utf8 and varchar(300 bytes) character set utf8 is 300 bytes.

There are some overheads with varchar(100 chars) that you don't have with 
varchar(300 bytes). Suppose you have a string of 200 English characters (say, 
'aaaaaa....', 200 'a' characters repeated). You should get an error when you 
try to store this in varchar(100 chars) but not if you store it in varchar(300 
bytes). Getting an error in the first case involves counting characters which 
requires a scan of the whole string. In the second case, we do not need to 
count characters.

Dave



-----Original Message-----
From: Liu, Yuan (Yuan) [mailto:yuan....@esgyn.cn] 
Sent: Tuesday, December 5, 2017 11:48 PM
To: dev@trafodion.incubator.apache.org
Subject: Is varchar(300 bytes) character set utf8 better than varchar(100 
chars) character set utf8?

Hi Trafodioneers,

I have a column that need to save 1 to 100 characters. The characters may 
include both English characters(such as ‘a’, ‘b’..) and Chinese characters(such 
as ‘你’,‘我’..)
So I can define the column type as varchar(100 chars) character set utf8, in 
that case for each character, no matter it is English or Chinese, it will take 
3 bytes space. For example, ‘abc你我’ will consume 15 bytes space.
So I consider to change the column type to varchar(100 bytes) character set 
utf8, in that case ‘abc你我’ will consume 9 bytes space. However, string with 
more than 34 Chinese characters can not be inserted because 34 Chinese 
characters will take more than 100 bytes. The worst case is to define the 
column type as varchar(300 bytes) character set utf8.

In general case, I think varchar(300 bytes) character set utf8 is better than 
varchar(100 chars) character set utf8 if not all characters are Chinese. Is 
that true?

Best regards,
Yuan

Reply via email to