Re: MySQL version support for Airflow 5.6 ->? 5.7

Ash Berlin-Taylor Sun, 01 Mar 2020 12:36:20 -0800

Oh the ti_id idea I mentioned below won't work very well for the case when we 
query by non-exact execution date


On 1 March 2020 20:06:08 GMT, Ash Berlin-Taylor <[email protected]> wrote:
>
>
>On 1 March 2020 19:40:59 GMT, Jarek Potiuk <[email protected]>
>wrote:
>>>
>>>
>>>
>>> does c) really help? We'd still need a unique key on the existing PK
>>> columns - i.e. does Mysql just complain about the length/size of the
>>PK
>>> index, but not other unique indexes?
>>>
>>
>>Right indeed - so  the problem (and I understand the reason why we
>>changed
>>1.10 -> 2.0 ) is also the uniqueness of key at insert. I imagine it is
>>an
>>occasional problem in 1.10 where multiple tasks re-run can insert the
>>same
>>key twice (ouch!).
>
>Shouldn't be a problem in practice as we delete XComs for a task before
>running out again. Might happen in very rare edge casesi guess.
>
>> Now I see why the change was introduced.
>>
>>
>>Unfortunately - It won't work if we try to index all 4 columns. Or
>even
>>3
>>(without execution date).  MySQL/InnoDB has index prefix limit
>><https://dev.mysql.com/doc/refman/5.7/en/innodb-limits.html> for all
>>indexes not only primary (and this is with default settings- 8KB page
>>size
>>+ large_prefix enabled).  In our XCom table, the Key is 512 characters
>>which translates to 2048 bytes in case of utf8mb4. Each of the
>>dag_id/task_id is 250 characters (luckily not 256 - you will see
>why!).
>>When you combine dag_id, task_id, key and execution date in single
>>index
>>you get 4*250 + 4*250 + 4* 512 + 8 = *4056* bytes. And the max key
>>length
>>for MySQL is *3072* bytes.
>>
>>So we have to find another way or simply say we don't support mysql's
>>full-UTF8-character set. :( . By the way - we are lucky that dag_id
>and
>>task_id are 250 chars and not 256 - if we were 256, then even utf8mb3
>>would
>>not be supported as the index size would be *3080* bytes).
>>
>>
>>> Irrespective of mysql's quirks, we would almost never look up rows
>by
>>this
>>> int-pk (we look up by dag+task+key, and often but not _awlays_ by
>>> execution_date as well), so I don't think we'd get any other
>benefits
>>from
>>> it.
>>>
>>
>>The benefit I spoke about is potentially much smaller index to
>look-up.
>>In
>>the current setup the primary key could be used by MSQL to do XCom
>>searches
>>and depending how well the optimiser works, it could actually be
>slower
>>than using secondary index. I believe there are some good reasons why
>>the
>>size of index are limited (index is read in 16KB blocks at a time so
>>you
>>have max 5 entries for index loaded at a time with 3072 size index and
>>max
>>8 with 2100 one. That might affect performance of the search - I bet
>>with
>>such big indexes we often hit the limits of cache an a lot of the
>index
>>is
>>read from the disk rather than kept in memory which is a huge
>>performance
>>problem that you start to see only when you have a lot of rows (in
>this
>>case - a lot of XCom values stored) - this is a big problem the
>maximum
>>size of index in MySQL tries to avoid.
>
>Experience has taught me that if you want the mysql optimiser to do
>something sensible: it won't. This may be better in newer versions, but
>was still a problem in 5.5.
>
>Smaller index lookups from where/what? All queries from dags etc would
>not use `WHERE id = ?` as you detail below.
>
>>I checked that we have those lookups for Xcom:
>>
>>1. ( dag_ids, task_id, key, execution_date ==  date)
>>2. ( dag_ids, task_id, key, execution_date <=  date)
>>3. ( dag_ids, key, execution_date ==  date) (no task_id  - for
>lineage)
>>
>>I could not find the "no execution-date" lookup for XCom. Did I miss
>it
>>somehow ? Or maybe it's something that was in the past but it's not
>any
>>more?
>
>Ah no, it's probably the <= case I was thinking of (where you can
>select XCom for past dates too.)
>
>>
>>In 1.10 (and it is still in 2.0) we have this index :
>>'idx_xcom_dag_task_date', 'xcom', ['dag_id', 'task_id',
>>'execution_date'] -
>>and it was used for most of the lookups (assumption was that key
>number
>>was
>>small). But in 2.0 the primary key as defined currently can (and
>>probably
>>will) be used for most of the lookups 1. (maybe partially for 3 as the
>>primary key is (key, execution_date, task_id, dag_id)) and - depending
>>how
>>the optimiser is written - with this size of the index it might be
>>slower
>>than secondary index lookup.
>>
>>Surprisingly, it looks like the lineage lookup (3.) uses almost full
>>table
>>scan - it can at most filter the dag_ids on the index but then it has
>>to
>>full-scan the rest. This is something to optimise regardless.
>>
>>The current index won't work for sure with utf8mb4 - it fails at
>>creation.
>>So we have to come up with some alternative approach for both -
>lookups
>>and
>>unique check.
>
>512 is probably longer than it needs to be, 250 could be long enough,
>or perhaps we have slightly different behaviour (column length) on
>mysql and other databases?
>>
>>If decreasing the size of the fields is not a good idea (maybe we
>>should
>>consider that!), then I think what we could do:
>>
>>   - hash the content of dag_id, task_id, key (or just hash the key) =
>>using md5 or similar and make primary key out of that - this way we
>can
>>   handle uniqueness check
>>   - keep the old index (dag_id, task_id, execution_date)
>>   - add new index for dag_id, key, execution_date - to support
>lineage
>>   lookup (dag_id, key, execution_date) = *3056 bytes* - just below
>the
>>   limit.
>>
>>Thoughts?
>
>Another possibility would be to give a TI an integer PK, and then make
>XCom (ti_id, key) (fk not required).
>
>I guess the hash approach has precedent in Airflow, but I'm not a huge
>fan of possible collisions, however rare (and would like to work
>towards removing it from the current tables.)
>
>Ash
>>
>> J.
>>
>>
>>-ash
>>> On Feb 27 2020, at 1:25 pm, Jarek Potiuk <[email protected]>
>>wrote:
>>> > I think the main issue here is not performance but to make it
>works
>>and
>>> be
>>> > compatible with 1.10. .
>>> >
>>> > And indeed if we go back for I'd index we would have to check the
>>impact.
>>> > Was insert performance the original reason why it has changed ?
>>Anyone
>>> > knows ?
>>> >
>>> > I am not sure what it is going to be but taking into account that
>>those
>>> > inserts in xcom table are always individual insert (not bulks) and
>>there
>>> is
>>> > likely more reads than writes - it might be either way easily.
>>> >
>>> > J..
>>> > czw., 27 lut 2020, 13:55 użytkownik Kamil Breguła <
>>> [email protected]>
>>> > napisał:
>>> >
>>> > > Hello,
>>> > > Creating a sequence/AUTOINCREMENT will have a very large impact
>>on the
>>> > > speed of adding rows.
>>> > > More information:
>>> https://docs.sqlalchemy.org/en/13/faq/performance.html
>>> > > In the near future I wanted to speed up the saving of items by
>>using a
>>> > > low-level API. This will not work if we different PK.
>>> > >
>>> > > Best regards,
>>> > > Kamil
>>> > >
>>> > > On Thu, Feb 27, 2020 at 1:50 PM Jarek Potiuk
>><[email protected]
>>> >
>>> > > wrote:
>>> > > >
>>> > > > I am testing now 5.7 in 1.10 branch with the PR here:
>>> > > > https://github.com/apache/airflow/pull/7569 and it seems it's
>>an
>>> easy
>>> > > > one one - seems it will "just work".
>>> > > >
>>> > > > However while testing it (I analysed a customer problem) I
>have
>>found
>>> > > > that we have an issue in 2.0 with MySQL (not only 5.7 - it's
>>the same
>>> > > > for 5.6 and even 8.0 (!). It boils down to an index (at least
>>one) is
>>> > > > to big for the recommended uf8mb4 encoding:
>>> > > >
>>> > > > ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id,
>>task_id,
>>> > > > `key`, execution_date).
>>> > > >
>>> > > > The utf8mb4 encoding is needed to handle 4-byte UTF characters
>>> (mostly
>>> > > > emojis but also some Chinese characters).
>>> > > >
>>> > > > The question is what do we do with that. I think we have two
>>options.
>>> > > > Not sure what was the story behind increasing the lengths
>>(apparently
>>> > > > the problem does not
>>> > > >
>>> > > > a) decrease sizes of some of the fields (that would make
>>backwards
>>> > > > incompatibility)
>>> > > > b) partial indexes of the fields (not really - this is primary
>>key so
>>> > > > I guess it's not a good idea, although MYSQL supports indexing
>>only
>>> > > > part of the string - we could cut key uniqueness at 200 chars
>>for
>>> > > > example - but again, backwards incompatibility is a problem)
>>> > > > c) revert to id - generated primary kay and make secondary
>>index
>>> > > > without the key - there are usually not many xcoms per task,
>so
>>I
>>> > > > would not expect any problems - actually I would expect this
>>will
>>> > > > speed up searches because the index will be much, much smaller
>>(50%
>>> > > > more or less). I think this is the most viable option.
>>> > > >
>>> > > > Any other ideas?
>>> > > > I described all details in this issue:
>>> > > > https://issues.apache.org/jira/browse/AIRFLOW-6947
>>> > > >
>>> > > > Here is similar story of somone who had similar issue
>>> > > > https://serversforhackers.com/c/mysql-utf8-and-indexing
>>> > > >
>>> > > > J.
>>> > > >
>>> > > > Jarek Potiuk
>>> > > > Polidea | Principal Software Engineer
>>> > > >
>>> > > > M: +48 660 796 129

Re: MySQL version support for Airflow 5.6 ->? 5.7

Reply via email to