Oh the ti_id idea I mentioned below won't work very well for the case when we query by non-exact execution date
On 1 March 2020 20:06:08 GMT, Ash Berlin-Taylor <[email protected]> wrote: > > >On 1 March 2020 19:40:59 GMT, Jarek Potiuk <[email protected]> >wrote: >>> >>> >>> >>> does c) really help? We'd still need a unique key on the existing PK >>> columns - i.e. does Mysql just complain about the length/size of the >>PK >>> index, but not other unique indexes? >>> >> >>Right indeed - so the problem (and I understand the reason why we >>changed >>1.10 -> 2.0 ) is also the uniqueness of key at insert. I imagine it is >>an >>occasional problem in 1.10 where multiple tasks re-run can insert the >>same >>key twice (ouch!). > >Shouldn't be a problem in practice as we delete XComs for a task before >running out again. Might happen in very rare edge casesi guess. > >> Now I see why the change was introduced. >> >> >>Unfortunately - It won't work if we try to index all 4 columns. Or >even >>3 >>(without execution date). MySQL/InnoDB has index prefix limit >><https://dev.mysql.com/doc/refman/5.7/en/innodb-limits.html> for all >>indexes not only primary (and this is with default settings- 8KB page >>size >>+ large_prefix enabled). In our XCom table, the Key is 512 characters >>which translates to 2048 bytes in case of utf8mb4. Each of the >>dag_id/task_id is 250 characters (luckily not 256 - you will see >why!). >>When you combine dag_id, task_id, key and execution date in single >>index >>you get 4*250 + 4*250 + 4* 512 + 8 = *4056* bytes. And the max key >>length >>for MySQL is *3072* bytes. >> >>So we have to find another way or simply say we don't support mysql's >>full-UTF8-character set. :( . By the way - we are lucky that dag_id >and >>task_id are 250 chars and not 256 - if we were 256, then even utf8mb3 >>would >>not be supported as the index size would be *3080* bytes). >> >> >>> Irrespective of mysql's quirks, we would almost never look up rows >by >>this >>> int-pk (we look up by dag+task+key, and often but not _awlays_ by >>> execution_date as well), so I don't think we'd get any other >benefits >>from >>> it. >>> >> >>The benefit I spoke about is potentially much smaller index to >look-up. >>In >>the current setup the primary key could be used by MSQL to do XCom >>searches >>and depending how well the optimiser works, it could actually be >slower >>than using secondary index. I believe there are some good reasons why >>the >>size of index are limited (index is read in 16KB blocks at a time so >>you >>have max 5 entries for index loaded at a time with 3072 size index and >>max >>8 with 2100 one. That might affect performance of the search - I bet >>with >>such big indexes we often hit the limits of cache an a lot of the >index >>is >>read from the disk rather than kept in memory which is a huge >>performance >>problem that you start to see only when you have a lot of rows (in >this >>case - a lot of XCom values stored) - this is a big problem the >maximum >>size of index in MySQL tries to avoid. > >Experience has taught me that if you want the mysql optimiser to do >something sensible: it won't. This may be better in newer versions, but >was still a problem in 5.5. > >Smaller index lookups from where/what? All queries from dags etc would >not use `WHERE id = ?` as you detail below. > >>I checked that we have those lookups for Xcom: >> >>1. ( dag_ids, task_id, key, execution_date == date) >>2. ( dag_ids, task_id, key, execution_date <= date) >>3. ( dag_ids, key, execution_date == date) (no task_id - for >lineage) >> >>I could not find the "no execution-date" lookup for XCom. Did I miss >it >>somehow ? Or maybe it's something that was in the past but it's not >any >>more? > >Ah no, it's probably the <= case I was thinking of (where you can >select XCom for past dates too.) > >> >>In 1.10 (and it is still in 2.0) we have this index : >>'idx_xcom_dag_task_date', 'xcom', ['dag_id', 'task_id', >>'execution_date'] - >>and it was used for most of the lookups (assumption was that key >number >>was >>small). But in 2.0 the primary key as defined currently can (and >>probably >>will) be used for most of the lookups 1. (maybe partially for 3 as the >>primary key is (key, execution_date, task_id, dag_id)) and - depending >>how >>the optimiser is written - with this size of the index it might be >>slower >>than secondary index lookup. >> >>Surprisingly, it looks like the lineage lookup (3.) uses almost full >>table >>scan - it can at most filter the dag_ids on the index but then it has >>to >>full-scan the rest. This is something to optimise regardless. >> >>The current index won't work for sure with utf8mb4 - it fails at >>creation. >>So we have to come up with some alternative approach for both - >lookups >>and >>unique check. > >512 is probably longer than it needs to be, 250 could be long enough, >or perhaps we have slightly different behaviour (column length) on >mysql and other databases? >> >>If decreasing the size of the fields is not a good idea (maybe we >>should >>consider that!), then I think what we could do: >> >> - hash the content of dag_id, task_id, key (or just hash the key) = >>using md5 or similar and make primary key out of that - this way we >can >> handle uniqueness check >> - keep the old index (dag_id, task_id, execution_date) >> - add new index for dag_id, key, execution_date - to support >lineage >> lookup (dag_id, key, execution_date) = *3056 bytes* - just below >the >> limit. >> >>Thoughts? > >Another possibility would be to give a TI an integer PK, and then make >XCom (ti_id, key) (fk not required). > >I guess the hash approach has precedent in Airflow, but I'm not a huge >fan of possible collisions, however rare (and would like to work >towards removing it from the current tables.) > >Ash >> >> J. >> >> >>-ash >>> On Feb 27 2020, at 1:25 pm, Jarek Potiuk <[email protected]> >>wrote: >>> > I think the main issue here is not performance but to make it >works >>and >>> be >>> > compatible with 1.10. . >>> > >>> > And indeed if we go back for I'd index we would have to check the >>impact. >>> > Was insert performance the original reason why it has changed ? >>Anyone >>> > knows ? >>> > >>> > I am not sure what it is going to be but taking into account that >>those >>> > inserts in xcom table are always individual insert (not bulks) and >>there >>> is >>> > likely more reads than writes - it might be either way easily. >>> > >>> > J.. >>> > czw., 27 lut 2020, 13:55 użytkownik Kamil Breguła < >>> [email protected]> >>> > napisał: >>> > >>> > > Hello, >>> > > Creating a sequence/AUTOINCREMENT will have a very large impact >>on the >>> > > speed of adding rows. >>> > > More information: >>> https://docs.sqlalchemy.org/en/13/faq/performance.html >>> > > In the near future I wanted to speed up the saving of items by >>using a >>> > > low-level API. This will not work if we different PK. >>> > > >>> > > Best regards, >>> > > Kamil >>> > > >>> > > On Thu, Feb 27, 2020 at 1:50 PM Jarek Potiuk >><[email protected] >>> > >>> > > wrote: >>> > > > >>> > > > I am testing now 5.7 in 1.10 branch with the PR here: >>> > > > https://github.com/apache/airflow/pull/7569 and it seems it's >>an >>> easy >>> > > > one one - seems it will "just work". >>> > > > >>> > > > However while testing it (I analysed a customer problem) I >have >>found >>> > > > that we have an issue in 2.0 with MySQL (not only 5.7 - it's >>the same >>> > > > for 5.6 and even 8.0 (!). It boils down to an index (at least >>one) is >>> > > > to big for the recommended uf8mb4 encoding: >>> > > > >>> > > > ALTER TABLE xcom ADD CONSTRAINT pk_xcom PRIMARY KEY (dag_id, >>task_id, >>> > > > `key`, execution_date). >>> > > > >>> > > > The utf8mb4 encoding is needed to handle 4-byte UTF characters >>> (mostly >>> > > > emojis but also some Chinese characters). >>> > > > >>> > > > The question is what do we do with that. I think we have two >>options. >>> > > > Not sure what was the story behind increasing the lengths >>(apparently >>> > > > the problem does not >>> > > > >>> > > > a) decrease sizes of some of the fields (that would make >>backwards >>> > > > incompatibility) >>> > > > b) partial indexes of the fields (not really - this is primary >>key so >>> > > > I guess it's not a good idea, although MYSQL supports indexing >>only >>> > > > part of the string - we could cut key uniqueness at 200 chars >>for >>> > > > example - but again, backwards incompatibility is a problem) >>> > > > c) revert to id - generated primary kay and make secondary >>index >>> > > > without the key - there are usually not many xcoms per task, >so >>I >>> > > > would not expect any problems - actually I would expect this >>will >>> > > > speed up searches because the index will be much, much smaller >>(50% >>> > > > more or less). I think this is the most viable option. >>> > > > >>> > > > Any other ideas? >>> > > > I described all details in this issue: >>> > > > https://issues.apache.org/jira/browse/AIRFLOW-6947 >>> > > > >>> > > > Here is similar story of somone who had similar issue >>> > > > https://serversforhackers.com/c/mysql-utf8-and-indexing >>> > > > >>> > > > J. >>> > > > >>> > > > Jarek Potiuk >>> > > > Polidea | Principal Software Engineer >>> > > > >>> > > > M: +48 660 796 129
