Re: AIP-42: Dynamic Task Mapping

Jarek Potiuk Wed, 10 Nov 2021 03:32:16 -0800

> So yes, you can use a generator literal, but it will be evaluated to 
> completion at DAG parse time.


Makes perfect sense.

> 2) For the UI part - I think we should consider "filtering" from day one and 
> first-class citizens. Filtering for "failed" tasks seems like a super-useful 
> feature for operations people.
>
>
> Yes, Brent and I have thought about that -- it's somewhat orthogonal to this 
> AIP in that the AIP doesn't depend filtering in the UI, and there are cases 
> where filtering would already be useful. But we will look at doing it as part 
> of this project.

Cool!

> Roughly what I'm proposing is the PK on the task_instance table becomes 
> something like (dag_run_id, task_id, mapping_index) where mapping_index is 
> now an array index into a JSON value in a row in the new task_mapping table. 
> And since primary key columns can't be nullable, and arrays are zero indexed, 
> we'd have to use -1 as "not mapped" value.

> Does that make any sense?

Absolutely. That was very much what I also had in mind. I think
indexing with auto-incremented id and keeping a table with the index
array makes way more sense to be honest.

It also makes it possible to implement a (potentially interesting)
case where mapping key values will actually be multiplicated without
adding any artificial indexing column. I am not sure how practical the
case is (though I believe it is quite a common case), but there is a
range of tasks  that might have different output even if they have the
same input.  Any kind of randomisation might cause for example that
the same learning task with exactly the same parameters will lead to a
different output. And you might want to run N of such identical tasks
and average, or somehow differently aggregate the result of those N
tasks with identical input.

It was also possible with the original design by adding an extra index
field to JSON which would make it unique, but it was a bit "not clean"
in the sense that it made the input "identic-ish". With the design
where we keep TaskMapping and index it with the index in the table, we
have a much more "clean" solution for this. You can clearly see which
of those tasks had identical input by simply comparing the JSONS.
Overall I think having a unique index to handle this case is generally
better design - even if we could index JSONB columns in mysql.

> There's still a bit more info to work out, such as how we find the right row 
> in task_mapping table, as it is associated with the _parent_ task, not the 
> mapped task itself.
Yep.

> Oh, it gets a bit more complicated actually! It is possible to do ` 
> task.map(x=list1, y=list2)` to map over two lists at once (making a cartesian 
> product) and in this case the mapping would be unique per _consuming_ tasks.

Yeah - as usual - the more digging, the more discovered. I think
simply each task in the group should have a unique, incremental ID
assigned. For cartesians that would mean that we have to agree on the
sequence of the inputs to take into account to calculate the index
(but they simply could be alphabetically sorted). And it also means
that we have to "fix it" at particular dag run "mapping evaluation"
time (but that's precisely what TaskMapping table will do I
understand).

J.

Re: AIP-42: Dynamic Task Mapping

Reply via email to