Hey everyone, Yaniv and I were just discussing how to resolve dependencies in the new frameworks architecture and integrate the dependencies with the concrete cluster resource manager (Mesos/YARN) We rolled with the idea of each runner (or base runner) performing the dependencies resolution on its own. So for example, the Spark Scala runner would resolve the required JARs and do whatever it needs to do with them (e.g. spark-submit --jars --packages --repositories, etc). The base Python provider will resolve dependencies and dynamically generate a requirement.txt file that will deployed to the executor. The handling of the requirements.txt file differs between different concrete Python runners. For example, a regular Python runner would simply run pip install, while the pyspark runner would need to rearrange the dependencies in a way that would be acceptable by spark-submit ( https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7 sounds like a decent idea, comment if you have a better idea please)
So far I hope it makes sense. The next item I want to discuss is as follows: In the new architecture, we do hierarchical runtime environment resolution, starting at the top job level and drilling down to the action level, outputting one unified environment configuration file that is deployed to the executor. I suggest doing the same with dependencies. Currently, we only have job level dependencies. I suggest that we provide action level dependencies and resolve them in exactly the same manner as we resolve the environment. There should be quite a few benefits for this approach: 1. It will give the option to have different versions of the same package in different actions. This is especially important if you have 2+ pipeline developers working independently, this would reduce the integration costs by letting each action be more self-contained. 2. It should lower the startup time per action. The more dependencies you have, the longer it takes to resolve and install them. Actions will no longer get any unnecessary dependencies. What do you think? Does it make sense? Cheers, Nadav