potiuk commented on issue #17897: URL: https://github.com/apache/airflow/issues/17897#issuecomment-914664135
> PS: solution works for both mariadb and mysql Great you checked - I also already saw that Our CI tests pass with utf8mb3_bin. Great that you found it before we made it a default for mysql! > @potiuk I would highly recommend to also add test case for such scenario... Just incase. I do not even know what KIND of test we want to start with there? What logic would you recommend? ** BEGINING OF RANT ** MySQL and MariaDB are very bad about the collation/encoding and it's really painful to workaround this - they have a long history of problems with those. First of all latin1 default encoding (with latin1_swedish_ci as default collation - note the case insensivity here as well) up until MySQL8 (Hey MySQL is from Sweden ;) ). Then they changed it to `utf8` in Mysql8. But hey.. When you use `utf8` in MySQL8 it's actually utf8mb3 which is (listen to that!) deprecated in the very same MySQL8 and it will be replaced at some point of time The message below is ridiculous, yet it is official documentation of MySQL: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html > The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 is expected to become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8. Now the problem is that they have also index size problem. Size of index is very limited and you blow up past the max index size once you have utf8mb4 (hey - it takes 4 times the number of characters you can fit). This is the reason why we had to specifically set utf8mb3_bin as collation for ids, because otherwise you would not be able to create Airflow DB. And yet the utf8mb3 is deprecated, but we have to use it. For worse - the encoding can be specified in 4 places. In 3 places in the database and separately in the client. And the one on the client might be completely diferent than the one on the server (it is by default derived from the LANG settings you have on the client and NOT from what your server's encoding/collation is - so, if by chance you have wrong settings of LANG on your client and you will not force it properly in the URL of the connection, then you are out of luck and if you happen to use an accented character (and only then) you will get random unicode/decode errors because the client will try to decode the characters encoded in different charset. Also here you might find a very nice tutorial - where you will learn how to deal with collation on MySQL https://www.mysqltutorial.org/mysql-collation/ - when you go the end of it, you will see: > Now, the c1 column has the latin1 character set, but what about its collation? Is it inheriting the latin1_german1_ci collation from the table’s collation? > The answer is no. The reason is that the default collation of the latin1 character set is latin1_swedish_ci, therefore, the c1 column will have the latin1_swedish_ci collation. In an official nice tutorial they mention that "even if you think that the encoding is this, it is completely counter-intuitive and it's something different".... C'mon. ** END OF RANT ** No. Honestly @dimon222 - I do not even know where to start and how to write any kind of tests that would deal with that madness. When I started my work with Airflow, I was warned that MySQL is a poor database. Well. YES. IT IS TERRIBLE DATABASE.. DON'T USE IT. USE POSTGRES Switch as soon as you can. Google Composer team switched to Postgres in Airflow 2 from MySQL in 1.10 because they realised that this is the only way they can deal with MySQL. For now we have tests that create and run the MySQL database with 'utf8mb4' encoding and `utf8mb3_bin` collation for ids. That's it. This is the only supported setting. No more. (Sorry I had to vent my frustration . I hope you will find it entertaining nevertheless. I find it amusing how they managed to screw the encoding stuff up such badly). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
