Hi Nadav,

It does make sense, in fact, we actually have action level resources
already, however they are limited to the configuration files for the
container.
I also think that we need to revision the way we set up those. Correctly we
use YARN/Mesos to copy dependencies to the containers. With YARN 3.0 I
think it makes sense to move to use Docker as the way to manage resources
in the containers.
This should also have performance benefits + will make life easier (I hope)
when we start working on K8s.

To do this, I think we need to add the following tasks:
1. Add to the JobManager the functionality to read action level dependencies
2. Move from Mesos/YARN containers to Docker (probably at least two tasks)

I'll add them to JIRA asap, for version 0.2.1-incubating if everyone is OK
with it.

On Sat, Oct 20, 2018 at 6:43 PM Nadav Har Tzvi <nadavhart...@gmail.com>
wrote:

> Hey everyone,
>
> Yaniv and I were just discussing how to resolve dependencies in the new
> frameworks architecture and integrate the dependencies with the concrete
> cluster resource manager (Mesos/YARN)
> We rolled with the idea of each runner (or base runner) performing the
> dependencies resolution on its own.
> So for example, the Spark Scala runner would resolve the required JARs and
> do whatever it needs to do with them (e.g. spark-submit --jars --packages
> --repositories, etc).
> The base Python provider will resolve dependencies and dynamically generate
> a requirement.txt file that will deployed to the executor.
> The handling of the requirements.txt file differs between different
> concrete Python runners. For example, a regular Python runner would simply
> run pip install, while the pyspark runner would need to rearrange the
> dependencies in a way that would be acceptable by spark-submit (
>
> https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7
> sounds like a decent idea, comment if you have a better idea please)
>
> So far I hope it makes sense.
>
> The next item I want to discuss is as follows:
> In the new architecture, we do hierarchical runtime environment resolution,
> starting at the top job level and drilling down to the action level,
> outputting one unified environment configuration file that is deployed to
> the executor.
> I suggest doing the same with dependencies.
> Currently, we only have job level dependencies. I suggest that we provide
> action level dependencies and resolve them in exactly the same manner as we
> resolve the environment.
> There should be quite a few benefits for this approach:
>
>    1. It will give the option to have different versions of the same
>    package in different actions. This is especially important if you have
> 2+
>    pipeline developers working independently, this would reduce the
>    integration costs by letting each action be more self-contained.
>    2. It should lower the startup time per action. The more dependencies
>    you have, the longer it takes to resolve and install them. Actions will
> no
>    longer get any unnecessary dependencies.
>
>
> What do you think? Does it make sense?
>
> Cheers,
> Nadav
>


-- 
Yaniv Rodenski

+61 477 778 405
ya...@shinto.io

Reply via email to