Jarek Potiuk created AIRFLOW-6947:
-------------------------------------

             Summary: UTF8mb4 encoding for mysql does not work in Airflow 2.0
                 Key: AIRFLOW-6947
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6947
             Project: Apache Airflow
          Issue Type: Improvement
          Components: mysql, database
    Affects Versions: 2.0.0
            Reporter: Jarek Potiuk


The problem is with how MySQL handles different encodings. Especially UTF8. 
UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters 
(only those encoded in 3 bytes) - the 4-bytes one are not working (there is an 
error -  "Incorrect string value: '\\xF0....' for column 'description' at row 
1") when you try to insert DAG with 4-bytes character unicode.

This a problem for example with DAG description that is stored in the database. 
One of our customers had this very issue with it's database and there database 
encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3 
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html which means 
it does not handle all characters (mostly Emojis) . In some future versions of 
mysql - UTF8 will become alias for utf8mb4 
(https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html) which 
supports full range of UTF-encoded characters. It is strongly advised to use 
utf8mb4 directly as default encoding.

I decided to see how it works with utf8mb4 encoding and - unfortunately it 
turns out that in case we switch to it, migration scripts for Airflow fails 
because row size for at least one of the indexes exceeds maximum row size:

‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is 
created.

ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`, 
execution_date)]

Apparently increased size of some columns (key?) make the row too big for 
utf8mb4 (in utf8mb4 encoding the text fields take 4x number of characters).

In our CI we had so far the default mysql encoding (which for the uninitiated 
is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the 
behaviour - and created PR here https://github.com/apache/airflow/pull/7570 and 
failed test here:

https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification&utm_source=github_status
 

Note similar problem occurs in 1.10 with MySQL 5.6 - if I change the charset to 
utf8mb4 and choose 5.6 mysql, it will fail because there the max key length was 
half the size (1536 characters).

There is even an issue for it in our JIRA 
https://issues.apache.org/jira/browse/AIRFLOW-3786. The workaround was to use 
the UTF8  (UTF8mb3) or switching to MySQL 5.7.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to