Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Dionisis Kontominas Sun, 05 Feb 2023 16:48:41 -0800

Hi Tom,

   1.  Regarding the different languages in the same column, that is normal
   if the column is a UTF-8 one, i.e. should be able to hold for example
   English, Greek and Chinese characters. In this case what is the best
   approach to define the collation and lctype of the column? Either C
   or ucs_basic maybe or something else or it does not matter, the characters
   would be stored correctly as long as the database is UTF8 encoding?
   2. Is there a real difference between C and ucs_basic collation/lctype?
   I have looked at the documentation and on the internet and most information
   converges that they are actually the same as behavior is concerned;
   actually no difference. What I suspect though as a non-similar is that C
   after the ASCII part sorts the characters according to byte code point,
   whereas the ucs_basic sorts those characters according to the Unicode code
   point which may be different I suppose. Can you confirm?
   3. In case the previous is correct I think that the most successful
   collation for unicode columns would e to set the collation to
   ucs_basic collation type and lctype accordingly if exists.



Regards,
Dionisis

On Mon, 6 Feb 2023 at 01:19, Tom Lane <[email protected]> wrote:

> Dionisis Kontominas <[email protected]> writes:
> >    I suppose that affects the outcome of ORDER BY clauses on the field,
> > along with the content of the indexes. Is this right?
>
> Yeah.
>
> >    Assuming that the requirement exists, to store UTF-8 characters on a
> > field that can be from multiple languages, and the database default
> > encoding is UTF8 which is the right thing I suppose (please verify), what
> > do you think should be the values of the Collation and Ctype for the
> > database to behave correctly?
>
> Um ... so define "correct".  If you have a mishmash of languages in the
> same column, it's likely that they have conflicting rules about sorting,
> and there may be no ordering that's not surprising to somebody.
>
> If there's a predominant language in the data, selecting a collation
> matching that seems like your best bet.  Otherwise, maybe you should
> just shrug your shoulders and stick with C collation.  It's likely
> to be faster than any alternative.
>
>                         regards, tom lane
>

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Reply via email to