Jarek Potiuk created AIRFLOW-6947:
-------------------------------------
Summary: UTF8mb4 encoding for mysql does not work in Airflow 2.0
Key: AIRFLOW-6947
URL: https://issues.apache.org/jira/browse/AIRFLOW-6947
Project: Apache Airflow
Issue Type: Improvement
Components: mysql, database
Affects Versions: 2.0.0
Reporter: Jarek Potiuk
The problem is with how MySQL handles different encodings. Especially UTF8.
UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters
(only those encoded in 3 bytes) - the 4-bytes one are not working (there is an
error - "Incorrect string value: '\\xF0....' for column 'description' at row
1") when you try to insert DAG with 4-bytes character unicode.
This a problem for example with DAG description that is stored in the database.
One of our customers had this very issue with it's database and there database
encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html which means
it does not handle all characters (mostly Emojis) . In some future versions of
mysql - UTF8 will become alias for utf8mb4
(https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html) which
supports full range of UTF-encoded characters. It is strongly advised to use
utf8mb4 directly as default encoding.
I decided to see how it works with utf8mb4 encoding and - unfortunately it
turns out that in case we switch to it, migration scripts for Airflow fails
because row size for at least one of the indexes exceeds maximum row size:
‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is
created.
ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`,
execution_date)]
Apparently increased size of some columns (key?) make the row too big for
utf8mb4 (in utf8mb4 encoding the text fields take 4x number of characters).
In our CI we had so far the default mysql encoding (which for the uninitiated
is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the
behaviour - and created PR here https://github.com/apache/airflow/pull/7570 and
failed test here:
https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification&utm_source=github_status
Note similar problem occurs in 1.10 with MySQL 5.6 - if I change the charset to
utf8mb4 and choose 5.6 mysql, it will fail because there the max key length was
half the size (1536 characters).
There is even an issue for it in our JIRA
https://issues.apache.org/jira/browse/AIRFLOW-3786. The workaround was to use
the UTF8 (UTF8mb3) or switching to MySQL 5.7.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)