[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

GitBox Tue, 07 Sep 2021 15:17:30 -0700


potiuk commented on issue #17897:
URL: https://github.com/apache/airflow/issues/17897#issuecomment-914664135



   > PS: solution works for both mariadb and mysql
   
   Great you checked - I also already saw that Our CI tests pass with 
utf8mb3_bin. Great that you found it before we made it a default for mysql! 
   
   > @potiuk I would highly recommend to also add test case for such 
scenario... Just incase.
   
   I do not even know what KIND of test we want to start with there? What logic 
would you recommend?
   
   ** BEGINING OF RANT **
   
   MySQL and MariaDB are very bad about the collation/encoding and it's really 
painful to workaround this - they have a long history of problems with those. 
First of all latin1 default encoding (with latin1_swedish_ci as default 
collation - note the case insensivity here as well) up until MySQL8 (Hey MySQL 
is from Sweden ;) ). 
   
   Then they changed it to `utf8` in Mysql8. But hey.. When you use `utf8` in 
MySQL8 it's actually utf8mb3 which is (listen to that!) deprecated in the very 
same MySQL8 and  it will be replaced at some point of time 
   
   The message below is ridiculous, yet it is official documentation of MySQL: 
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
   
   > The utf8mb3 character set is deprecated and you should expect it to be 
removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is 
currently an alias for utf8mb3, at some point utf8 is expected to become a 
reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider 
specifying utf8mb4 explicitly for character set references instead of utf8.
    
   Now the problem is that they have also index size problem. Size of index is 
very limited and you blow up past the max index size once you have utf8mb4 (hey 
- it takes 4 times the number of characters you can fit). This is the reason 
why we had to specifically set utf8mb3_bin as collation for ids, because 
otherwise you would not be able to create Airflow DB. And yet the utf8mb3 is 
deprecated, but we have to use it.
   
   For worse - the encoding can be specified in 4 places. In 3 places in the 
database and separately in the client. And the one on the client might be 
completely diferent than the one on the server (it is by default derived from 
the LANG settings you have on the client and NOT from what your server's 
encoding/collation is - so, if by chance you have wrong settings of LANG on 
your client and you will not force it properly in the URL of the connection, 
then you are out of luck and if you happen to use an accented character (and 
only then) you will get random unicode/decode errors because the client will 
try to decode the characters encoded in different charset. 
   
   Also here you might find a very nice tutorial - where you will learn how to 
deal with collation on MySQL https://www.mysqltutorial.org/mysql-collation/ - 
when you go the end of it, you will see:
   
   > Now, the c1 column has the latin1  character set, but what about its 
collation? Is it inheriting the latin1_german1_ci collation from the table’s 
collation?
   > The answer is no. The reason is that the default collation of the latin1 
character set is latin1_swedish_ci, therefore, the c1 column will have the 
latin1_swedish_ci collation.
   
   In an official nice tutorial they mention that "even if you think that the 
encoding is this, it is completely counter-intuitive and it's something 
different".... C'mon.
   
   ** END OF RANT **
   
   No. Honestly @dimon222  - I do not even know where to start and how to write 
any kind of tests that would deal with that madness. When I started my work 
with Airflow, I was warned that MySQL is a poor database. Well. 
   
   YES. IT IS TERRIBLE DATABASE.. DON'T USE IT. USE POSTGRES
   
   Switch as soon as you can. Google Composer team switched to Postgres in 
Airflow 2 from MySQL in 1.10 because they realised that this is the only way 
they can deal with MySQL.
   
   For now we have tests that create and run the MySQL database with 'utf8mb4' 
encoding and `utf8mb3_bin` collation for ids. That's it. This is the only 
supported setting. No more.
   
   (Sorry I had to vent my frustration . I hope you will find it entertaining 
nevertheless. I find it amusing how they managed to screw the encoding stuff up 
such badly).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on issue #17897: Dag tags are not refreshing if case sensitivity changed

Reply via email to