I am kinda inclined towards a solution that could be performed at setup
time. Let's try to explore in that direction. If we manage to nail this at
setup time, it means that the runtime will be lightning fast (compared to
what it is now :))

Cheers,
Nadav


On 28 May 2018 at 00:17, Nadav Har Tzvi <nadavhart...@gmail.com> wrote:

> Hey everyone,
>
> So we have this issue, Anaconda takes forever to deploy on the executors,
> whether it is YARN or Mesos.
>
> Let's first discuss why is it like this right now.
>
> First, let's see for each platform, how Apache Amaterasu interacts with
> the underlying platform, in regard to what smallest independent unit that
> is awarded its own isolated execution environment.
>
> *Apache Mesos:*
> In Apache Mesos, we get our own nifty set of instances and executors. An
> instance obviously can host multiple executors. depending on its capacity.
> Thus the smallest independent unit here is the executor itself.
>
> *Apache Hadoop YARN*:
> On YARN, we have a similar set of resources, we have nodes, each node is a
> host to containers.
>
> Great, so far it sounds similar, right? Here is where Apache Amaterasu
> takes things a bit differently for each platform.
>
> In Apache Mesos, everything is run on the same executor, regardless of how
> many actions the job has. So if the job has 20 actions, they will run
> sequentially on the same executor, resulting in the smallest independent
> unit being the job itself, as only the job deserves its own running
> environment.
>
> On Hadoop, things are different, a lot.
> To start, each action is treated by YARN as a different application, with
> its own set of containers. This means that on YARN, action is the smallest
> independent unit.
>
> So what's the problem actually? So the problem in general is that we
> cannot rely on the existence of 3rd party utilities, libraries, you name
> it, on the target execution environment. This forces us to bundle anything
> we need along with the job execution process.
> Anaconda is exactly such 3rd party utility that we desperately need in
> order to run PySpark code that has dependencies on more than PySpark itself
> and pure Python. (Pandas, numpy, sklearn, there are more than enough
> examples out there)
> We need to install Anaconda once for each execution environment. In Apache
> Mesos our smallest reliable execution environment is the executor itself,
> thus we need to install Anaconda once per job.
> In YARN, our smallest execution environment is the container, hence we
> need to install Anaconda over and over for each action.
> This obviously poses a problem because of numerous reasons:
> 1. While we can make an excuse in the first action that it is setup time,
> it is obvious that for the second action we are wasting time, a lot. To
> compare Mesos and YARN, starting the second action on Mesos is a matter of
> seconds. In YARN it is measured in minutes.
> 2. We do the same thing over and over again, even if we run on the same
> machine. This makes no sense whatsoever! We are losing the ability to cache
> things. So for example, if I need numpy and that takes about 20-30 seconds
> to download and install, why do I need to install it from scratch over and
> over again?
> 3. It causes code reliability issues. If Miniconda isn't there and I need
> to roll a PySpark job, I now have to setup guards and fallbacks and what
> not? Even worse, I have to find weird tricks to even get access to the
> Miniconda environment, and that is different on Mesos and YARN, so now I
> have a jungle in the code!
> 4. On YARN, PySpark runs on yet a different container! Guess what?! This
> container has no access to miniconda! We currently use --py-files to send a
> list of gazzilion packages. This is different in Mesos, where PySpark
> itself runs in the same executor as the main Amaterasu process.
> So guess what? I now have a jungle in my PySpark invocation code too!
>
> Also take a note that the current implementation for Python 3rd party
> dependencies resolution is Anaconda, this gives us an isolated environment
> that doesn't rely on the existing Python (cause maybe, for some reason, you
> have Python 2.5 on your cluster, which is not supported by new versions of
> data libraries such as pandas, numpy and so forth), in addition it gives us
> the nifty Conda package manager.
> However, it doesn't mean that it has to stay that way. If the need or
> reason arises, we may need to also support pip and support using the native
> Python version (instead of the one supplied by Anaconda).
>
> I want to discuss the possible solutions to this. Please feel free to
> bring up your ideas.
>
> Cheers,
> Nadav
>
>

Reply via email to