Hi Gino,

Thanks for explaining the scope of Toree.

What I was looking for is a solution where Toree can play the role of a
facade between the client application (in this case the notebook) and the
underlying Spark cluster. So if the client application submit a command it
can accept it and execute it using underlying spark infrastructure (may be
stand alone, on mesos, or on YARN) and return back the result.

I someway like the option 2 too as I think it is in the similar line of my
requirement. However, not sure whether I have got it fully.

What essentially I'm looking for is a solution where the Jupyter would be
running on individual data scientists' laptop. The Jupyter will issue the
command from the laptop and the Toree client will accept it and send it to
the Toree server running on the Spark Cluster. Toree server will run that
on Spark and return the results back.

To achieve this requirement using option 2, can one potentially change
Jupyter (or add an extension) which can send the request to Toree running
on the provision layer over Zero MQ (or any other protocol like REST) ?

Regards,
Sourav

On Thu, May 5, 2016 at 6:47 AM, Gino Bustelo <[email protected]> wrote:

> >>>>>>>>>>>>>>>>>>>
> Hi Gino,
>
> It does not solve the problem of running a Spark job  (on Yarn) remotely
> from a Jupyter notebook which is running on say in a laptop/some machine.
>
> The issue is in yarn-client mode the laptop needs to get access to all the
> slave nodes where the executors would be running. In a typical security
> scenario of an organization the slave nodes are behind firewall and cannot
> be accessed from any random machine outside.
>
> Regards,
> Sourav
> >>>>>>>>>>>>>>>>>>>
>
>
> Sourav, I'm very much aware about the network implication of Spark (not
> exclusive to YARN). The typical way that I've seen this problem solved is:
>
> 1. You manages/host Jupyter in a privilege network space that can have
> access to the Spark cluster. This involves no code changes on either
> Jupyter or Toree, but has the added cost for the service provider of
> managing this frontend tool
>
> 2. You create a provisioner layer in a privilege network space to manage
> Kernels (Toree) and modify Jupyter through extensions to understand how to
> communicate with that provisioner layer. The pro of this is that you don't
> have to manage the Notebooks, but the service provider still need to build
> that provisioning layer and proxy the Kernels communication channels.
>
> My preference is for #2. I think that frontend tools do not need to live
> close to Spark, but processes like Toree should be as close to the compute
> cluster as possible.
>
> Toree's scope is to be a Spark Driver program that allows "interactive
> computing". It is not it's scope to provide a full fledge
> provisioning/hosting solution to access Spark. That is left to the
> implementers of Spark offerings to select the best way to manage Toree
> kernels (i.e. Yarn, Mesos, Docker, etc...).
>
> Thanks,
> Gino
>
> On Sat, Apr 30, 2016 at 9:53 PM, Gino Bustelo <[email protected]> wrote:
>
> > This is not possible without extending Jupyter. By default, Jupyter start
> > kernels as local processes. To be able to launch remote kernels you need
> to
> > provide an extension to the KernelManager and have some sort of kernel
> > provisioner to then manage the remote kernels. It is not something hard
> to
> > do, but there is really nothing out there that I know of that you can use
> > out of the box.
> >
> > Gino B.
> >
> > > On Apr 30, 2016, at 6:25 PM, Sourav Mazumder <
> > [email protected]> wrote:
> > >
> > > Hi,
> > >
> > >
> > > is there any documentation which can be user to configure a local
> Jupyter
> > > process to talk remotely to a remote Apache Toree server ?
> > >
> > > Regards,
> > > Sourav
> >
>

Reply via email to