This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/main by this push:
     new b658a42  Change id collation for MySQL to case-sensitive (#18072)
b658a42 is described below

commit b658a4243fb5b22b81677ac0f30c767dfc3a1b4b
Author: Jarek Potiuk <[email protected]>
AuthorDate: Wed Sep 8 01:17:22 2021 +0200

    Change id collation for MySQL to case-sensitive (#18072)
    
    For quite some time we recommended MySQL collation to be
    utf8mb3_general_ci in order to avoid too-large-index size. Turns
    out that this collation is .... case-insensitive (that's where
    ci stands for) this causes problems in case of renaming
    tags (!) where only the case differs (Test -> test) as those
    tags are considered equal (!). It would also cause problems if
    there were several DAGs with ids differing by case only.
    
    Moreoever ... there is no "cs" (case sensitive) collation for
    utf8 for MySQL as this is apparently a hard problem:
    
    
https://stackoverflow.com/questions/4558707/case-sensitive-collation-in-mysql
    
    The solution in this PR is to change collation to utf8mb3_bin -
    it messes up with ORDER BY, but this is not a big problem for ID
    kind of values.
    
    Fixes: #17897
---
 airflow/config_templates/config.yml           | 2 +-
 airflow/config_templates/default_airflow.cfg  | 2 +-
 airflow/models/base.py                        | 4 ++--
 docs/apache-airflow/howto/set-up-database.rst | 4 ++--
 tests/models/test_base.py                     | 4 ++--
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/airflow/config_templates/config.yml 
b/airflow/config_templates/config.yml
index da35382..239ffd7 100644
--- a/airflow/config_templates/config.yml
+++ b/airflow/config_templates/config.yml
@@ -82,7 +82,7 @@
       description: |
         Collation for ``dag_id``, ``task_id``, ``key`` columns in case they 
have different encoding.
         By default this collation is the same as the database collation, 
however for ``mysql`` and ``mariadb``
-        the default is ``utf8mb3_general_ci`` so that the index sizes of our 
index keys will not exceed
+        the default is ``utf8mb3_bin`` so that the index sizes of our index 
keys will not exceed
         the maximum size of allowed index when collation is set to ``utf8mb4`` 
variant
         (see 
https://github.com/apache/airflow/pull/17603#issuecomment-901121618).
       version_added: 2.0.0
diff --git a/airflow/config_templates/default_airflow.cfg 
b/airflow/config_templates/default_airflow.cfg
index e36a6eb..b004627 100644
--- a/airflow/config_templates/default_airflow.cfg
+++ b/airflow/config_templates/default_airflow.cfg
@@ -64,7 +64,7 @@ sql_engine_encoding = utf-8
 
 # Collation for ``dag_id``, ``task_id``, ``key`` columns in case they have 
different encoding.
 # By default this collation is the same as the database collation, however for 
``mysql`` and ``mariadb``
-# the default is ``utf8mb3_general_ci`` so that the index sizes of our index 
keys will not exceed
+# the default is ``utf8mb3_bin`` so that the index sizes of our index keys 
will not exceed
 # the maximum size of allowed index when collation is set to ``utf8mb4`` 
variant
 # (see https://github.com/apache/airflow/pull/17603#issuecomment-901121618).
 # sql_engine_collation_for_ids =
diff --git a/airflow/models/base.py b/airflow/models/base.py
index 0322124..02d230d 100644
--- a/airflow/models/base.py
+++ b/airflow/models/base.py
@@ -44,7 +44,7 @@ def get_id_collation_args():
     if collation:
         return {'collation': collation}
     else:
-        # Automatically use utf8mb3_general_ci collation for mysql
+        # Automatically use utf8mb3_bin collation for mysql
         # This is backwards-compatible. All our IDS are ASCII anyway so even if
         # we migrate from previously installed database with different 
collation and we end up mixture of
         # COLLATIONS, it's not a problem whatsoever (and we keep it small 
enough so that our indexes
@@ -56,7 +56,7 @@ def get_id_collation_args():
         # parameters, so we use the connection
         conn = conf.get('core', 'sql_alchemy_conn', fallback='')
         if conn.startswith('mysql') or conn.startswith("mariadb"):
-            return {'collation': 'utf8mb3_general_ci'}
+            return {'collation': 'utf8mb3_bin'}
         return {}
 
 
diff --git a/docs/apache-airflow/howto/set-up-database.rst 
b/docs/apache-airflow/howto/set-up-database.rst
index 02395d5..518cb53 100644
--- a/docs/apache-airflow/howto/set-up-database.rst
+++ b/docs/apache-airflow/howto/set-up-database.rst
@@ -163,7 +163,7 @@ In the example below, a database ``airflow_db`` and user  
with username ``airflo
 
    The database must use a UTF-8 character set. A small caveat that you must 
be aware of is that utf8 in newer versions of MySQL is really utf8mb4 which
    causes Airflow indexes to grow too large (see 
https://github.com/apache/airflow/pull/17603#issuecomment-901121618). Therefore 
as of Airflow 2.2
-   all MySQL databases have ``sql_engine_collation_for_ids`` set automatically 
to ``utf8mb3_general_ci`` (unless you override it). This might
+   all MySQL databases have ``sql_engine_collation_for_ids`` set automatically 
to ``utf8mb3_bin`` (unless you override it). This might
    lead to a mixture of collation ids for id fields in Airflow Database, but 
it has no negative consequences since all relevant IDs in Airflow use
    ASCII characters only.
 
@@ -187,7 +187,7 @@ without any cert options provided.
 However if you want to use other drivers visit the `MySQL Dialect 
<https://docs.sqlalchemy.org/en/13/dialects/mysql.html>`__  in SQLAlchemy 
documentation for more information regarding download
 and setup of the SqlAlchemy connection.
 
-In addition, you also should pay particular attention to MySQL's encoding. 
Although the ``utf8mb4`` character set is more and more popular for MySQL 
(actually, ``utf8mb4`` becomes default character set in MySQL8.0), using the 
``utf8mb4`` encoding requires additional setting in Airflow 2+ (See more 
details in `#7570 <https://github.com/apache/airflow/pull/7570>`__.). If you 
use ``utf8mb4`` as character set, you should also set 
``sql_engine_collation_for_ids=utf8mb3_general_ci``.
+In addition, you also should pay particular attention to MySQL's encoding. 
Although the ``utf8mb4`` character set is more and more popular for MySQL 
(actually, ``utf8mb4`` becomes default character set in MySQL8.0), using the 
``utf8mb4`` encoding requires additional setting in Airflow 2+ (See more 
details in `#7570 <https://github.com/apache/airflow/pull/7570>`__.). If you 
use ``utf8mb4`` as character set, you should also set 
``sql_engine_collation_for_ids=utf8mb3_bin``.
 
 Setting up a PostgreSQL Database
 --------------------------------
diff --git a/tests/models/test_base.py b/tests/models/test_base.py
index fa25313..5579986 100644
--- a/tests/models/test_base.py
+++ b/tests/models/test_base.py
@@ -26,8 +26,8 @@ from tests.test_utils.config import conf_vars
     ("dsn", "expected", "extra"),
     [
         param("postgres://host/the_database", {}, {}, id="postgres"),
-        param("mysql://host/the_database", {"collation": 
"utf8mb3_general_ci"}, {}, id="mysql"),
-        param("mysql+pymsql://host/the_database", {"collation": 
"utf8mb3_general_ci"}, {}, id="mysql+pymsql"),
+        param("mysql://host/the_database", {"collation": "utf8mb3_bin"}, {}, 
id="mysql"),
+        param("mysql+pymsql://host/the_database", {"collation": 
"utf8mb3_bin"}, {}, id="mysql+pymsql"),
         param(
             "mysql://host/the_database",
             {"collation": "ascii"},

Reply via email to