Hi Björn,

The time for more than 4 bytes in UTF8 will never come, and even the emojis 
expand so that more than 1112064 “characters”  , new encoding will not be 
called UTF8 anymore, and I doubt it will even be called Unicode. 

UTF8 is not up to 7 characters. While the encoding scheme with leading/trailing 
bytes could allow for more  than 4 bytes, this was explicitly clarified and 
forbidden in the RFC3629 https://tools.ietf.org/html/rfc3629#section-4 , along 
with encoding of  unpaired  “surrogate” characters from UTF16, so basically 
UTF8 can encode everything in UTF16, and not more than that.

The utf8mb4 story is that - there was a discussion IIRC  during MySQL 5.5 
development, whether to continue using UTF8 name or whether to create a new 
name, for the Unicode (2.0+) conforming charset. As you noticed , traditional 
MySQL’s version of UTF8 is castrated. On the other hand, reusing a name for 
something different could possibly lead to compatibility problems with existing 
applications. The conservative decision was for the new name for the real (in 
Unicode sense) UTF8. The “utf8mb4” name is not pretty, confusing, but no 
compatibility problems were reported.


From: Björn Keil
Sent: Friday, 11 October 2019 12:10
To: [email protected]
Subject: Re: [Maria-discuss] Limited Unicode Support?

Thanks for the replies. I've tried to just replace all occurrences of "utf8" in 
my example with "utf8mb4" and it works.

Inconveniently this will require major conversations and down times for my 
application, but at least I know what I must do to make it work.

However, the "mb4" sounds a little suspicious, though. While there are no 
sufficiently high numbered Unicode Points yet that would make such a measure 
necessary, the UTF-8 encoding allows for up to seven byte long characters, if I 
am not mistaken. Does utf8mb4 allow for more than four byte long characters if 
in and when the time comes?

Am Do., 10. Okt. 2019 um 17:18 Uhr schrieb Diego Dupin 
<[email protected]>:
Hi björn, 

🙋 is  a 4 bytes encoded character (0xF0 0x9F 0x99 0x8B).

"utf8" is a 3-Byte UTF-8 Unicode encoding. 
You have to configure charset "utf8mb4" that permits full utf8 support. 
https://jira.mariadb.org/browse/MDEV-8334 in 10.5 is the first step to makes 
utf8mb4 default for 'utf8'.

regards,
diego.


On Thu, Oct 10, 2019 at 3:53 PM Björn Keil <[email protected]> wrote:
Hello,

I hope this is the proper mailing list to ask such questions, I apologise if it 
isn't.

I am having some problems with unusual Unicode characters in my MariaDB 
database.

$ mariadb --version
mariadb  Ver 15.1 Distrib 10.3.17-MariaDB, for debian-linux-gnu (x86_64) using 
readline 5.2
$ sudo ./mariadb.php
[sudo] Passwort für bjoern: 
Query: INSERT INTO `test` SET `string` = '🙋 Huhu. wie geht es dir?'
Inserted: '🙋 Huhu. wie geht es dir?'
Returned: '???? Huhu. wie geht es dir?'

SHOW VARIABLES LIKE 'character%':
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/

As you can see here, MariaDB does not take the character '🙋' ( 
https://www.fileformat.info/info/unicode/char/1f64b/index.htm ) and instead 
replaces it with four question marks and I have no idea why.

I've attached the PHP code for the example.

I would be most grateful for any suggestion.

Regards,
Björn Keil
_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp

_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp

Reply via email to