Re: MySQL version support for Airflow 5.6 ->? 5.7

Ash Berlin-Taylor Sun, 01 Mar 2020 12:06:20 -0800

On 1 March 2020 19:40:59 GMT, Jarek Potiuk <[email protected]> wrote:
>>
>>
>>
>> does c) really help? We'd still need a unique key on the existing PK
>> columns - i.e. does Mysql just complain about the length/size of the
>PK
>> index, but not other unique indexes?
>>
>
>Right indeed - so  the problem (and I understand the reason why we
>changed
>1.10 -> 2.0 ) is also the uniqueness of key at insert. I imagine it is
>an
>occasional problem in 1.10 where multiple tasks re-run can insert the
>same
>key twice (ouch!).

Shouldn't be a problem in practice as we delete XComs for a task before running 
out again. Might happen in very rare edge casesi guess.

> Now I see why the change was introduced.
>
>
>Unfortunately - It won't work if we try to index all 4 columns. Or even
>3
>(without execution date).  MySQL/InnoDB has index prefix limit
><https://dev.mysql.com/doc/refman/5.7/en/innodb-limits.html> for all
>indexes not only primary (and this is with default settings- 8KB page
>size
>+ large_prefix enabled).  In our XCom table, the Key is 512 characters
>which translates to 2048 bytes in case of utf8mb4. Each of the
>dag_id/task_id is 250 characters (luckily not 256 - you will see why!).
>When you combine dag_id, task_id, key and execution date in single
>index
>you get 4*250 + 4*250 + 4* 512 + 8 = *4056* bytes. And the max key
>length
>for MySQL is *3072* bytes.
>
>So we have to find another way or simply say we don't support mysql's
>full-UTF8-character set. :( . By the way - we are lucky that dag_id and
>task_id are 250 chars and not 256 - if we were 256, then even utf8mb3
>would
>not be supported as the index size would be *3080* bytes).
>
>
>> Irrespective of mysql's quirks, we would almost never look up rows by
>this
>> int-pk (we look up by dag+task+key, and often but not _awlays_ by
>> execution_date as well), so I don't think we'd get any other benefits
>from
>> it.
>>
>
>The benefit I spoke about is potentially much smaller index to look-up.
>In
>the current setup the primary key could be used by MSQL to do XCom
>searches
>and depending how well the optimiser works, it could actually be slower
>than using secondary index. I believe there are some good reasons why
>the
>size of index are limited (index is read in 16KB blocks at a time so
>you
>have max 5 entries for index loaded at a time with 3072 size index and
>max
>8 with 2100 one. That might affect performance of the search - I bet
>with
>such big indexes we often hit the limits of cache an a lot of the index
>is
>read from the disk rather than kept in memory which is a huge
>performance
>problem that you start to see only when you have a lot of rows (in this
>case - a lot of XCom values stored) - this is a big problem the maximum
>size of index in MySQL tries to avoid.

Experience has taught me that if you want the mysql optimiser to do something 
sensible: it won't. This may be better in newer versions, but was still a 
problem in 5.5.

Smaller index lookups from where/what? All queries from dags etc would not use 
`WHERE id = ?` as you detail below.

>I checked that we have those lookups for Xcom:
>
>1. ( dag_ids, task_id, key, execution_date ==  date)
>2. ( dag_ids, task_id, key, execution_date <=  date)
>3. ( dag_ids, key, execution_date ==  date) (no task_id  - for lineage)
>
>I could not find the "no execution-date" lookup for XCom. Did I miss it
>somehow ? Or maybe it's something that was in the past but it's not any
>more?

Ah no, it's probably the <= case I was thinking of (where you can select XCom 
for past dates too.)

>
>In 1.10 (and it is still in 2.0) we have this index :
>'idx_xcom_dag_task_date', 'xcom', ['dag_id', 'task_id',
>'execution_date'] -
>and it was used for most of the lookups (assumption was that key number
>was
>small). But in 2.0 the primary key as defined currently can (and
>probably
>will) be used for most of the lookups 1. (maybe partially for 3 as the
>primary key is (key, execution_date, task_id, dag_id)) and - depending
>how
>the optimiser is written - with this size of the index it might be
>slower
>than secondary index lookup.
>
>Surprisingly, it looks like the lineage lookup (3.) uses almost full
>table
>scan - it can at most filter the dag_ids on the index but then it has
>to
>full-scan the rest. This is something to optimise regardless.
>
>The current index won't work for sure with utf8mb4 - it fails at
>creation.
>So we have to come up with some alternative approach for both - lookups
>and
>unique check.

512 is probably longer than it needs to be, 250 could be long enough, or 
perhaps we have slightly different behaviour (column length) on mysql and other 
databases?
>
>If decreasing the size of the fields is not a good idea (maybe we
>should
>consider that!), then I think what we could do:
>
>   - hash the content of dag_id, task_id, key (or just hash the key) =
>using md5 or similar and make primary key out of that - this way we can
>   handle uniqueness check
>   - keep the old index (dag_id, task_id, execution_date)
>   - add new index for dag_id, key, execution_date - to support lineage
>   lookup (dag_id, key, execution_date) = *3056 bytes* - just below the
>   limit.
>
>Thoughts?

Another possibility would be to give a TI an integer PK, and then make XCom 
(ti_id, key) (fk not required).

I guess the hash approach has precedent in Airflow, but I'm not a huge fan of 
possible collisions, however rare (and would like to work towards removing it 
from the current tables.)

Ash
>
> J.
>
>
>-ash
>> On Feb 27 2020, at 1:25 pm, Jarek Potiuk <[email protected]>
>wrote:
>> > I think the main issue here is not performance but to make it works
>and
>> be
>> > compatible with 1.10. .
>> >
>> > And indeed if we go back for I'd index we would have to check the
>impact.
>> > Was insert performance the original reason why it has changed ?
>Anyone
>> > knows ?
>> >
>> > I am not sure what it is going to be but taking into account that
>those
>> > inserts in xcom table are always individual insert (not bulks) and
>there
>> is
>> > likely more reads than writes - it might be either way easily.
>> >
>> > J..
>> > czw., 27 lut 2020, 13:55 użytkownik Kamil Breguła <
>> [email protected]>
>> > napisał:
>> >
>> > > Hello,
>> > > Creating a sequence/AUTOINCREMENT will have a very large impact
>on the
>> > > speed of adding rows.
>> > > More information:
>> https://docs.sqlalchemy.org/en/13/faq/performance.html
>> > > In the near future I wanted to speed up the saving of items by
>using a
>> > > low-level API. This will not work if we different PK.
>> > >
>> > > Best regards,
>> > > Kamil
>> > >
>> > > On Thu, Feb 27, 2020 at 1:50 PM Jarek Potiuk
><[email protected]
>> >
>> > > wrote:
>> > > >
>> > > > I am testing now 5.7 in 1.10 branch with the PR here:
>> > > > https://github.com/apache/airflow/pull/7569 and it seems it's
>an
>> easy
>> > > > one one - seems it will "just work".
>> > > >
>> > > > However while testing it (I analysed a customer problem) I have
>found
>> > > > that we have an issue in 2.0 with MySQL (not only 5.7 - it's
>the same
>> > > > for 5.6 and even 8.0 (!). It boils down to an index (at least
>one) is
>> > > > to big for the recommended uf8mb4 encoding:
>> > > >
>> > > > ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id,
>task_id,
>> > > > `key`, execution_date).
>> > > >
>> > > > The utf8mb4 encoding is needed to handle 4-byte UTF characters
>> (mostly
>> > > > emojis but also some Chinese characters).
>> > > >
>> > > > The question is what do we do with that. I think we have two
>options.
>> > > > Not sure what was the story behind increasing the lengths
>(apparently
>> > > > the problem does not
>> > > >
>> > > > a) decrease sizes of some of the fields (that would make
>backwards
>> > > > incompatibility)
>> > > > b) partial indexes of the fields (not really - this is primary
>key so
>> > > > I guess it's not a good idea, although MYSQL supports indexing
>only
>> > > > part of the string - we could cut key uniqueness at 200 chars
>for
>> > > > example - but again, backwards incompatibility is a problem)
>> > > > c) revert to id - generated primary kay and make secondary
>index
>> > > > without the key - there are usually not many xcoms per task, so
>I
>> > > > would not expect any problems - actually I would expect this
>will
>> > > > speed up searches because the index will be much, much smaller
>(50%
>> > > > more or less). I think this is the most viable option.
>> > > >
>> > > > Any other ideas?
>> > > > I described all details in this issue:
>> > > > https://issues.apache.org/jira/browse/AIRFLOW-6947
>> > > >
>> > > > Here is similar story of somone who had similar issue
>> > > > https://serversforhackers.com/c/mysql-utf8-and-indexing
>> > > >
>> > > > J.
>> > > >
>> > > > Jarek Potiuk
>> > > > Polidea | Principal Software Engineer
>> > > >
>> > > > M: +48 660 796 129
Re: MySQL version support for Airflow 5.6 ->? 5.7

Reply via email to