[jira] [Updated] (AIRFLOW-6947) UTF8mb4 encoding for mysql does not work in Airflow 2.0

2020-02-27 Thread Jarek Potiuk (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarek Potiuk updated AIRFLOW-6947:
--
Description: 
The problem is with how MySQL handles different encodings. Especially UTF8. 
UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters 
(only those encoded in 3 bytes) - the 4-bytes one are not working (there is an 
error -  "Incorrect string value: '
 xF0' for column 'description' at row 1") when you try to insert DAG with 
4-bytes character unicode.

This a problem mainly with DAG description that is stored in the database. One 
of our customers had this very issue with it's database and there database 
encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3 
[https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html] which means 
it does not handle all characters (mostly Emojis but also some chinese 
characters 
[https://stackoverflow.com/questions/17680237/mysql-four-byte-chinese-characters-support]
 ) . In some future versions of mysql - UTF8 will become alias for utf8mb4 
([https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html]) which 
supports full range of UTF-encoded characters. It is strongly advised to use 
utf8mb4 directly as default encoding.

I decided to see how it works with utf8mb4 encoding and - unfortunately it 
turns out that in case we switch to it, migration scripts for Airflow fails 
because row size for at least one of the indexes exceeds maximum row size:

‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is 
created.

ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`, 
execution_date)]

In Airflow 1.10 the primary key for an xcom was an integer and in 2.0 it is a 
compound index with dag_id, task_id, execution_date and key - they together 
make the row too big for utf8mb4 (in utf8mb4 encoding the text fields take 4x 
number of characters).

In our CI we had so far the default mysql encoding (which for the uninitiated 
is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the 
behaviour - and created PR here [https://github.com/apache/airflow/pull/7570] 
and failed test here:

[https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification_source=github_status]
 

Note similar problem occurs in 1.10 with MySQL 5.6 - if I change the charset to 
utf8mb4 and choose 5.6 mysql, it will fail because there the max key length was 
half the size (1536 characters).

There is even an issue for it in our JIRA 
https://issues.apache.org/jira/browse/AIRFLOW-3786 - for different index. The 
workaround was to use the UTF8  (UTF8mb3) or switching to MySQL 5.7.

  was:
The problem is with how MySQL handles different encodings. Especially UTF8. 
UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters 
(only those encoded in 3 bytes) - the 4-bytes one are not working (there is an 
error -  "Incorrect string value: '
 xF0' for column 'description' at row 1") when you try to insert DAG with 
4-bytes character unicode.

This a problem mainly with DAG description that is stored in the database. One 
of our customers had this very issue with it's database and there database 
encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3 
[https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html] which means 
it does not handle all characters (mostly Emojis but also some chinese 
characters 
[https://stackoverflow.com/questions/17680237/mysql-four-byte-chinese-characters-support]
 ) . In some future versions of mysql - UTF8 will become alias for utf8mb4 
([https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html]) which 
supports full range of UTF-encoded characters. It is strongly advised to use 
utf8mb4 directly as default encoding.

I decided to see how it works with utf8mb4 encoding and - unfortunately it 
turns out that in case we switch to it, migration scripts for Airflow fails 
because row size for at least one of the indexes exceeds maximum row size:

‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is 
created.

ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`, 
execution_date)]

In Airflow 1.10 the primary key was an integer and in 2.0 it is a compound 
index with dag_id, task_id, execution_date and key - they together make the row 
too big for utf8mb4 (in utf8mb4 encoding the text fields take 4x number of 
characters).

In our CI we had so far the default mysql encoding (which for the uninitiated 
is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the 
behaviour - and created PR here [https://github.com/apache/airflow/pull/7570] 
and failed test here:

[https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification_source=github_status]
 

Note similar problem occurs in 1.10 with MySQL 5.6 - if I 

[jira] [Updated] (AIRFLOW-6947) UTF8mb4 encoding for mysql does not work in Airflow 2.0

2020-02-27 Thread Jarek Potiuk (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarek Potiuk updated AIRFLOW-6947:
--
Description: 
The problem is with how MySQL handles different encodings. Especially UTF8. 
UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters 
(only those encoded in 3 bytes) - the 4-bytes one are not working (there is an 
error -  "Incorrect string value: '
 xF0' for column 'description' at row 1") when you try to insert DAG with 
4-bytes character unicode.

This a problem mainly with DAG description that is stored in the database. One 
of our customers had this very issue with it's database and there database 
encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3 
[https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html] which means 
it does not handle all characters (mostly Emojis but also some chinese 
characters 
[https://stackoverflow.com/questions/17680237/mysql-four-byte-chinese-characters-support]
 ) . In some future versions of mysql - UTF8 will become alias for utf8mb4 
([https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html]) which 
supports full range of UTF-encoded characters. It is strongly advised to use 
utf8mb4 directly as default encoding.

I decided to see how it works with utf8mb4 encoding and - unfortunately it 
turns out that in case we switch to it, migration scripts for Airflow fails 
because row size for at least one of the indexes exceeds maximum row size:

‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is 
created.

ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`, 
execution_date)]

In Airflow 1.10 the primary key was an integer and in 2.0 it is a compound 
index with dag_id, task_id, execution_date and key - they together make the row 
too big for utf8mb4 (in utf8mb4 encoding the text fields take 4x number of 
characters).

In our CI we had so far the default mysql encoding (which for the uninitiated 
is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the 
behaviour - and created PR here [https://github.com/apache/airflow/pull/7570] 
and failed test here:

[https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification_source=github_status]
 

Note similar problem occurs in 1.10 with MySQL 5.6 - if I change the charset to 
utf8mb4 and choose 5.6 mysql, it will fail because there the max key length was 
half the size (1536 characters).

There is even an issue for it in our JIRA 
https://issues.apache.org/jira/browse/AIRFLOW-3786 - for different index. The 
workaround was to use the UTF8  (UTF8mb3) or switching to MySQL 5.7.

  was:
The problem is with how MySQL handles different encodings. Especially UTF8. 
UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters 
(only those encoded in 3 bytes) - the 4-bytes one are not working (there is an 
error -  "Incorrect string value: '
xF0' for column 'description' at row 1") when you try to insert DAG with 
4-bytes character unicode.

This a problem mainly with DAG description that is stored in the database. One 
of our customers had this very issue with it's database and there database 
encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3 
[https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html] which means 
it does not handle all characters (mostly Emojis but also some chinese 
characters 
[https://stackoverflow.com/questions/17680237/mysql-four-byte-chinese-characters-support]
 ) . In some future versions of mysql - UTF8 will become alias for utf8mb4 
([https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html]) which 
supports full range of UTF-encoded characters. It is strongly advised to use 
utf8mb4 directly as default encoding.

I decided to see how it works with utf8mb4 encoding and - unfortunately it 
turns out that in case we switch to it, migration scripts for Airflow fails 
because row size for at least one of the indexes exceeds maximum row size:

‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is 
created.

ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`, 
execution_date)]

Apparently increased size of some columns (key?) make the row too big for 
utf8mb4 (in utf8mb4 encoding the text fields take 4x number of characters).

In our CI we had so far the default mysql encoding (which for the uninitiated 
is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the 
behaviour - and created PR here [https://github.com/apache/airflow/pull/7570] 
and failed test here:

[https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification_source=github_status]
 

Note similar problem occurs in 1.10 with MySQL 5.6 - if I change the charset to 
utf8mb4 and choose 5.6 mysql, it will fail because there the max key length was 
half 

[jira] [Updated] (AIRFLOW-6947) UTF8mb4 encoding for mysql does not work in Airflow 2.0

2020-02-27 Thread Jarek Potiuk (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarek Potiuk updated AIRFLOW-6947:
--
Description: 
The problem is with how MySQL handles different encodings. Especially UTF8. 
UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters 
(only those encoded in 3 bytes) - the 4-bytes one are not working (there is an 
error -  "Incorrect string value: '
xF0' for column 'description' at row 1") when you try to insert DAG with 
4-bytes character unicode.

This a problem mainly with DAG description that is stored in the database. One 
of our customers had this very issue with it's database and there database 
encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3 
[https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html] which means 
it does not handle all characters (mostly Emojis but also some chinese 
characters 
[https://stackoverflow.com/questions/17680237/mysql-four-byte-chinese-characters-support]
 ) . In some future versions of mysql - UTF8 will become alias for utf8mb4 
([https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html]) which 
supports full range of UTF-encoded characters. It is strongly advised to use 
utf8mb4 directly as default encoding.

I decided to see how it works with utf8mb4 encoding and - unfortunately it 
turns out that in case we switch to it, migration scripts for Airflow fails 
because row size for at least one of the indexes exceeds maximum row size:

‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is 
created.

ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`, 
execution_date)]

Apparently increased size of some columns (key?) make the row too big for 
utf8mb4 (in utf8mb4 encoding the text fields take 4x number of characters).

In our CI we had so far the default mysql encoding (which for the uninitiated 
is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the 
behaviour - and created PR here [https://github.com/apache/airflow/pull/7570] 
and failed test here:

[https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification_source=github_status]
 

Note similar problem occurs in 1.10 with MySQL 5.6 - if I change the charset to 
utf8mb4 and choose 5.6 mysql, it will fail because there the max key length was 
half the size (1536 characters).

There is even an issue for it in our JIRA 
https://issues.apache.org/jira/browse/AIRFLOW-3786. The workaround was to use 
the UTF8  (UTF8mb3) or switching to MySQL 5.7.

  was:
The problem is with how MySQL handles different encodings. Especially UTF8. 
UTF8 in Mysql - default utf8 encoding - does not handle all UTF8 characters 
(only those encoded in 3 bytes) - the 4-bytes one are not working (there is an 
error -  "Incorrect string value: '\\xF0' for column 'description' at row 
1") when you try to insert DAG with 4-bytes character unicode.

This a problem for example with DAG description that is stored in the database. 
One of our customers had this very issue with it's database and there database 
encoding is utf8. Current utf8 behaviour - is that it is an alias to utf8mb3 
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html which means 
it does not handle all characters (mostly Emojis) . In some future versions of 
mysql - UTF8 will become alias for utf8mb4 
(https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html) which 
supports full range of UTF-encoded characters. It is strongly advised to use 
utf8mb4 directly as default encoding.

I decided to see how it works with utf8mb4 encoding and - unfortunately it 
turns out that in case we switch to it, migration scripts for Airflow fails 
because row size for at least one of the indexes exceeds maximum row size:

‘'Specified key was too long; max key length is 3072 bytes'” when XCOM key is 
created.

ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, task_id, `key`, 
execution_date)]

Apparently increased size of some columns (key?) make the row too big for 
utf8mb4 (in utf8mb4 encoding the text fields take 4x number of characters).

In our CI we had so far the default mysql encoding (which for the uninitiated 
is latin1_swedish_ci (!), I switched it to utf8mb4 so that you can see the 
behaviour - and created PR here https://github.com/apache/airflow/pull/7570 and 
failed test here:

https://travis-ci.org/apache/airflow/jobs/655733996?utm_medium=notification_source=github_status
 

Note similar problem occurs in 1.10 with MySQL 5.6 - if I change the charset to 
utf8mb4 and choose 5.6 mysql, it will fail because there the max key length was 
half the size (1536 characters).

There is even an issue for it in our JIRA 
https://issues.apache.org/jira/browse/AIRFLOW-3786. The workaround was to use 
the UTF8  (UTF8mb3) or switching to MySQL 5.7.


> UTF8mb4 encoding for mysql does not work in