Hi Nadav, It does make sense, in fact, we actually have action level resources already, however they are limited to the configuration files for the container. I also think that we need to revision the way we set up those. Correctly we use YARN/Mesos to copy dependencies to the containers. With YARN 3.0 I think it makes sense to move to use Docker as the way to manage resources in the containers. This should also have performance benefits + will make life easier (I hope) when we start working on K8s.
To do this, I think we need to add the following tasks: 1. Add to the JobManager the functionality to read action level dependencies 2. Move from Mesos/YARN containers to Docker (probably at least two tasks) I'll add them to JIRA asap, for version 0.2.1-incubating if everyone is OK with it. On Sat, Oct 20, 2018 at 6:43 PM Nadav Har Tzvi <nadavhart...@gmail.com> wrote: > Hey everyone, > > Yaniv and I were just discussing how to resolve dependencies in the new > frameworks architecture and integrate the dependencies with the concrete > cluster resource manager (Mesos/YARN) > We rolled with the idea of each runner (or base runner) performing the > dependencies resolution on its own. > So for example, the Spark Scala runner would resolve the required JARs and > do whatever it needs to do with them (e.g. spark-submit --jars --packages > --repositories, etc). > The base Python provider will resolve dependencies and dynamically generate > a requirement.txt file that will deployed to the executor. > The handling of the requirements.txt file differs between different > concrete Python runners. For example, a regular Python runner would simply > run pip install, while the pyspark runner would need to rearrange the > dependencies in a way that would be acceptable by spark-submit ( > > https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7 > sounds like a decent idea, comment if you have a better idea please) > > So far I hope it makes sense. > > The next item I want to discuss is as follows: > In the new architecture, we do hierarchical runtime environment resolution, > starting at the top job level and drilling down to the action level, > outputting one unified environment configuration file that is deployed to > the executor. > I suggest doing the same with dependencies. > Currently, we only have job level dependencies. I suggest that we provide > action level dependencies and resolve them in exactly the same manner as we > resolve the environment. > There should be quite a few benefits for this approach: > > 1. It will give the option to have different versions of the same > package in different actions. This is especially important if you have > 2+ > pipeline developers working independently, this would reduce the > integration costs by letting each action be more self-contained. > 2. It should lower the startup time per action. The more dependencies > you have, the longer it takes to resolve and install them. Actions will > no > longer get any unnecessary dependencies. > > > What do you think? Does it make sense? > > Cheers, > Nadav > -- Yaniv Rodenski +61 477 778 405 ya...@shinto.io