AW: AW: Store strings in UTF-8/UNICODE format

Zabach, Elke Wed, 27 Oct 2004 07:46:28 -0700

Hermann Himmelbauer wrote:
> 
> On Wednesday 27 October 2004 08:35, Zabach, Elke wrote:
> > Hermann Himmelbauer wrote:
> > > Hi,
> > > I need to store multibyte characters into my database (MaxDB). To my
> > > mind, there are two approaches:
> > >
> > > 1) Store everything in UTF-8 format in normal varchars. I don't know
> if
> > > MaxDB
> > > will have problems with this, probably there will be errors such as
> that
> > > the
> > > string length can be faulty.
> > > 2) Store everything in the 2-byte Unicode format - but this will
> nearly
> > > double
> > > the size of my database, moreover I don't know if I will have to
> convert
> > > every string from/to UTF-8 to/from Unicode.
> > >
> > > What's the appropriate solution?
> >
> > It is (as always) depending
> > - on the number of characters not being ASCII-7-Bit, but being stored
> with
> > 3 or more bytes in UTF8. You said, that storing in UCS2 will nearly
> double,
> > so we can assume, ASCII-7-bit is the main thing and only few character
> will
> > need more than one byte even in UTF8 - on the functions / qualifications
> > you need.
> > With UTF8 stored in (VAR)CHAR (n) BYTE you will not be able to handle
> > LENGTH, SUBSTR, (x)FILL, (x)TRIM, and so on or LIKE correctly. Sorting
> > could do, perhaps even in a way you can live with. - on the knowledge
> about
> > the data you have. If you want to store at maximum n character (no
> matter
> > if ASCII or UNICODE), you will have to specify (VAR)CHAR (n). But if you
> > want to store n UTF8-character in a (VAR)CHAR(x)BYTE-column, which value
> is
> > needed for x (not to be too small if several character need more than
> just
> > one byte)? BTW: the database then will not be able to check if more than
> n
> > character are stored in this value. - (VAR)CHAR(n) BYTE cannot be
> compared
> > to (VAR)CHAR(n) ASCII/UNICODE, but on the other hand: to store UTF8 in
> > (VAR)CHAR(n) ASCII is not the recommended way of using, although I do
> not
> > know any knockout-criteria for this. - For storing as UTF8 no
> > _UNICODE=YES-instance is needed, meaning that no unicode-storage of the
> > catalog (info about tables, columns, views, users,...) is done. - If a
> > record needs for example half the space in UTF8 than it does in UCS2,
> > double the number of records can be stored in one page, meaning less
> pages
> > are needed for the same number of records, meaning less I/O / smaller
> data
> > cache needed for all records if your search strategies are so bad to
> need a
> > scan.
> >
> > ==> if you are sure that you now and ever will only need a stupid
> container
> > for your data and all handling (functions and so on, LIKE) will be done
> in
> > the application, then maybe (although storage is not too expensive)
> storing
> > as UTF8 is a good decision. If you really know this...
> 
> Many thanks for your quick answer!
> 
> So it seems best to store my strings in the Unicode (UCS-2) format.
> Storage
> space is not a big issue.
> 
> To explain my situation, I already have a MaxDB database running in ASCII
> mode. Characters are now encoded using the latin-1 character set. As our
> application will become international, we have to switch to Unicode.
> 
> At first I need to convert the data from latin-1 to unicode - can this be
> done
> with MaxDB (e.g. with the loader utility) or do I have to do the
> conversion
> from 1byte ASCII to 2byte UCS by hand (e.g. python script)?



The loader will do the job


> 
> Moreover I access the database via ODBC (Windows and Linux) and via the
> python
> interface. Is unicode supported on these call interfaces? 

Yes for ODBC.


> I read somewhere
> that there are problems with Unicode on Linux ODBC - is this true?


Yes. Problem is the size of SQLWCHAR. We understand it as 2-bytes.
On Unix/Linux usually wchar_t has 4-bytes. You have to translate
between both codings when using MaxDB ODBC, if you want to use
wchar_t.

On Windows the Unicode-Driver is sqlod32w.dll and usually registered
under "MaxDB (Unicode)". It supports the full range of the
ODBC-Unicode-API. Under Linux/Unix both APIs are in libsqlod.[a|so].
In order to get it right you should always use SQLConnectW to
establish a Unicode-ready connection to the DB.


> Moreover I
> read that it's possible to read/write Unicode strings with python but it's
> impossible to use unicode strings for query strings - is this still true?


Query strings in unicode still aren't possible, but you can use parameters for most 
cases:
session.sqlX ("select * from tableA where columnX like ?", ['�%'])

or, if you use the same statement several times with changing values:
insert = session.prepare ('insert into tableA values (?, ?, ?)')
for v1, v2, v3 in iterator:
    insert.execute ([v1, v2, v3])


> I
> need simple select statements like "select * from tableA where columnX
> like
> '�%'" and this should be possible also from python.
> 
> What about sorting tables in unicode? Will e.g. "�" be interpreted as "OE"
> and
> be sorted accordingly?

Sorting is done as with ASCII: in hex-sequence. Therefore � will be somewhere behind 
Z. If you want/need � to be ordered as OE/Oe, then please use MAPCHAR-Sets.

Elke
SAP Labs Berlin

> 
> Best Regards,
> Hermann
> 
> --
> [EMAIL PROTECTED]
> GPG key ID: 299893C7 (on keyservers)
> FP: 0124 2584 8809 EF2A DBF9  4902 64B4 D16B 2998 93C7

-- 
MaxDB Discussion Mailing List
For list archives: http://lists.mysql.com/maxdb
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

AW: AW: Store strings in UTF-8/UNICODE format

Reply via email to