This is an automated email from the ASF dual-hosted git repository.
potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git
The following commit(s) were added to refs/heads/main by this push:
new b658a42 Change id collation for MySQL to case-sensitive (#18072)
b658a42 is described below
commit b658a4243fb5b22b81677ac0f30c767dfc3a1b4b
Author: Jarek Potiuk <[email protected]>
AuthorDate: Wed Sep 8 01:17:22 2021 +0200
Change id collation for MySQL to case-sensitive (#18072)
For quite some time we recommended MySQL collation to be
utf8mb3_general_ci in order to avoid too-large-index size. Turns
out that this collation is .... case-insensitive (that's where
ci stands for) this causes problems in case of renaming
tags (!) where only the case differs (Test -> test) as those
tags are considered equal (!). It would also cause problems if
there were several DAGs with ids differing by case only.
Moreoever ... there is no "cs" (case sensitive) collation for
utf8 for MySQL as this is apparently a hard problem:
https://stackoverflow.com/questions/4558707/case-sensitive-collation-in-mysql
The solution in this PR is to change collation to utf8mb3_bin -
it messes up with ORDER BY, but this is not a big problem for ID
kind of values.
Fixes: #17897
---
airflow/config_templates/config.yml | 2 +-
airflow/config_templates/default_airflow.cfg | 2 +-
airflow/models/base.py | 4 ++--
docs/apache-airflow/howto/set-up-database.rst | 4 ++--
tests/models/test_base.py | 4 ++--
5 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/airflow/config_templates/config.yml
b/airflow/config_templates/config.yml
index da35382..239ffd7 100644
--- a/airflow/config_templates/config.yml
+++ b/airflow/config_templates/config.yml
@@ -82,7 +82,7 @@
description: |
Collation for ``dag_id``, ``task_id``, ``key`` columns in case they
have different encoding.
By default this collation is the same as the database collation,
however for ``mysql`` and ``mariadb``
- the default is ``utf8mb3_general_ci`` so that the index sizes of our
index keys will not exceed
+ the default is ``utf8mb3_bin`` so that the index sizes of our index
keys will not exceed
the maximum size of allowed index when collation is set to ``utf8mb4``
variant
(see
https://github.com/apache/airflow/pull/17603#issuecomment-901121618).
version_added: 2.0.0
diff --git a/airflow/config_templates/default_airflow.cfg
b/airflow/config_templates/default_airflow.cfg
index e36a6eb..b004627 100644
--- a/airflow/config_templates/default_airflow.cfg
+++ b/airflow/config_templates/default_airflow.cfg
@@ -64,7 +64,7 @@ sql_engine_encoding = utf-8
# Collation for ``dag_id``, ``task_id``, ``key`` columns in case they have
different encoding.
# By default this collation is the same as the database collation, however for
``mysql`` and ``mariadb``
-# the default is ``utf8mb3_general_ci`` so that the index sizes of our index
keys will not exceed
+# the default is ``utf8mb3_bin`` so that the index sizes of our index keys
will not exceed
# the maximum size of allowed index when collation is set to ``utf8mb4``
variant
# (see https://github.com/apache/airflow/pull/17603#issuecomment-901121618).
# sql_engine_collation_for_ids =
diff --git a/airflow/models/base.py b/airflow/models/base.py
index 0322124..02d230d 100644
--- a/airflow/models/base.py
+++ b/airflow/models/base.py
@@ -44,7 +44,7 @@ def get_id_collation_args():
if collation:
return {'collation': collation}
else:
- # Automatically use utf8mb3_general_ci collation for mysql
+ # Automatically use utf8mb3_bin collation for mysql
# This is backwards-compatible. All our IDS are ASCII anyway so even if
# we migrate from previously installed database with different
collation and we end up mixture of
# COLLATIONS, it's not a problem whatsoever (and we keep it small
enough so that our indexes
@@ -56,7 +56,7 @@ def get_id_collation_args():
# parameters, so we use the connection
conn = conf.get('core', 'sql_alchemy_conn', fallback='')
if conn.startswith('mysql') or conn.startswith("mariadb"):
- return {'collation': 'utf8mb3_general_ci'}
+ return {'collation': 'utf8mb3_bin'}
return {}
diff --git a/docs/apache-airflow/howto/set-up-database.rst
b/docs/apache-airflow/howto/set-up-database.rst
index 02395d5..518cb53 100644
--- a/docs/apache-airflow/howto/set-up-database.rst
+++ b/docs/apache-airflow/howto/set-up-database.rst
@@ -163,7 +163,7 @@ In the example below, a database ``airflow_db`` and user
with username ``airflo
The database must use a UTF-8 character set. A small caveat that you must
be aware of is that utf8 in newer versions of MySQL is really utf8mb4 which
causes Airflow indexes to grow too large (see
https://github.com/apache/airflow/pull/17603#issuecomment-901121618). Therefore
as of Airflow 2.2
- all MySQL databases have ``sql_engine_collation_for_ids`` set automatically
to ``utf8mb3_general_ci`` (unless you override it). This might
+ all MySQL databases have ``sql_engine_collation_for_ids`` set automatically
to ``utf8mb3_bin`` (unless you override it). This might
lead to a mixture of collation ids for id fields in Airflow Database, but
it has no negative consequences since all relevant IDs in Airflow use
ASCII characters only.
@@ -187,7 +187,7 @@ without any cert options provided.
However if you want to use other drivers visit the `MySQL Dialect
<https://docs.sqlalchemy.org/en/13/dialects/mysql.html>`__ in SQLAlchemy
documentation for more information regarding download
and setup of the SqlAlchemy connection.
-In addition, you also should pay particular attention to MySQL's encoding.
Although the ``utf8mb4`` character set is more and more popular for MySQL
(actually, ``utf8mb4`` becomes default character set in MySQL8.0), using the
``utf8mb4`` encoding requires additional setting in Airflow 2+ (See more
details in `#7570 <https://github.com/apache/airflow/pull/7570>`__.). If you
use ``utf8mb4`` as character set, you should also set
``sql_engine_collation_for_ids=utf8mb3_general_ci``.
+In addition, you also should pay particular attention to MySQL's encoding.
Although the ``utf8mb4`` character set is more and more popular for MySQL
(actually, ``utf8mb4`` becomes default character set in MySQL8.0), using the
``utf8mb4`` encoding requires additional setting in Airflow 2+ (See more
details in `#7570 <https://github.com/apache/airflow/pull/7570>`__.). If you
use ``utf8mb4`` as character set, you should also set
``sql_engine_collation_for_ids=utf8mb3_bin``.
Setting up a PostgreSQL Database
--------------------------------
diff --git a/tests/models/test_base.py b/tests/models/test_base.py
index fa25313..5579986 100644
--- a/tests/models/test_base.py
+++ b/tests/models/test_base.py
@@ -26,8 +26,8 @@ from tests.test_utils.config import conf_vars
("dsn", "expected", "extra"),
[
param("postgres://host/the_database", {}, {}, id="postgres"),
- param("mysql://host/the_database", {"collation":
"utf8mb3_general_ci"}, {}, id="mysql"),
- param("mysql+pymsql://host/the_database", {"collation":
"utf8mb3_general_ci"}, {}, id="mysql+pymsql"),
+ param("mysql://host/the_database", {"collation": "utf8mb3_bin"}, {},
id="mysql"),
+ param("mysql+pymsql://host/the_database", {"collation":
"utf8mb3_bin"}, {}, id="mysql+pymsql"),
param(
"mysql://host/the_database",
{"collation": "ascii"},