Hi Yaniv, I am happy to pick up the following task
1. Add to the JobManager the functionality to read action level dependencies Regards, Kirupa On Tue., 23 Oct. 2018, 11:04 am Yaniv Rodenski, <ya...@shinto.io> wrote: > Hi Nadav, > > It does make sense, in fact, we actually have action level resources > already, however they are limited to the configuration files for the > container. > I also think that we need to revision the way we set up those. Correctly we > use YARN/Mesos to copy dependencies to the containers. With YARN 3.0 I > think it makes sense to move to use Docker as the way to manage resources > in the containers. > This should also have performance benefits + will make life easier (I hope) > when we start working on K8s. > > To do this, I think we need to add the following tasks: > 1. Add to the JobManager the functionality to read action level > dependencies > 2. Move from Mesos/YARN containers to Docker (probably at least two tasks) > > I'll add them to JIRA asap, for version 0.2.1-incubating if everyone is OK > with it. > > On Sat, Oct 20, 2018 at 6:43 PM Nadav Har Tzvi <nadavhart...@gmail.com> > wrote: > > > Hey everyone, > > > > Yaniv and I were just discussing how to resolve dependencies in the new > > frameworks architecture and integrate the dependencies with the concrete > > cluster resource manager (Mesos/YARN) > > We rolled with the idea of each runner (or base runner) performing the > > dependencies resolution on its own. > > So for example, the Spark Scala runner would resolve the required JARs > and > > do whatever it needs to do with them (e.g. spark-submit --jars --packages > > --repositories, etc). > > The base Python provider will resolve dependencies and dynamically > generate > > a requirement.txt file that will deployed to the executor. > > The handling of the requirements.txt file differs between different > > concrete Python runners. For example, a regular Python runner would > simply > > run pip install, while the pyspark runner would need to rearrange the > > dependencies in a way that would be acceptable by spark-submit ( > > > > > https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7 > > sounds like a decent idea, comment if you have a better idea please) > > > > So far I hope it makes sense. > > > > The next item I want to discuss is as follows: > > In the new architecture, we do hierarchical runtime environment > resolution, > > starting at the top job level and drilling down to the action level, > > outputting one unified environment configuration file that is deployed to > > the executor. > > I suggest doing the same with dependencies. > > Currently, we only have job level dependencies. I suggest that we provide > > action level dependencies and resolve them in exactly the same manner as > we > > resolve the environment. > > There should be quite a few benefits for this approach: > > > > 1. It will give the option to have different versions of the same > > package in different actions. This is especially important if you have > > 2+ > > pipeline developers working independently, this would reduce the > > integration costs by letting each action be more self-contained. > > 2. It should lower the startup time per action. The more dependencies > > you have, the longer it takes to resolve and install them. Actions > will > > no > > longer get any unnecessary dependencies. > > > > > > What do you think? Does it make sense? > > > > Cheers, > > Nadav > > > > > -- > Yaniv Rodenski > > +61 477 778 405 > ya...@shinto.io >