Re: [jupyter] Scala Kernel Discussion

tristanz Mon, 20 Mar 2017 16:46:11 -0700

Kyle,

Many thanks for organizing this call and I apologize for the delay 
responding to this thread.  I agree with the summary and key action items. 
 I'm particularly interested in features that all frontends can benefit 
from, e.g. well-supported kernels and associated display protocols. 
 There's a number of things we can do to improve the Spark experience, but 
with the right composable pieces this can be done entirely in user space.


I look forward to contributing to some of these initiatives.  As some may 
have seen, Cloudera just announced an upcoming Data Science Workbench 
product 
(https://www.cloudera.com/products/data-science-and-engineering/data-science-workbench.html).
 
 It leverages Jupyter kernels at the core.  Some things Kyle mentions like 
the lack of clean HTML isolation do make things more difficult than they 
should be.  But I think nteract, JupyterLab, and Data Science Workbench 
show how flexible Jupyter is when building on top of the core primitives.

By the way, if anybody is interested in working on these sorts of things 
full time, we just posted a related software engineering position at 
Cloudera <https://jobs.jobvite.com/cloudera/job/omEV4fwJ>.  Feel free to 
email me directly.

Tristan


On Monday, March 6, 2017 at 4:38:06 PM UTC-8, rgbkrk wrote:
>
> Alejandro,
>
> Thanks for responding. I did a poor job of maintaining the list of emails 
> of everyone I was reaching out to, hopefully everyone interested is on the 
> jupyter mailing list now and we can hold regular meetings. I'm really happy 
> to see all of your feedback.
>
> This does blend into 
>
> Responses inline.
>
>
> On Mon, Mar 6, 2017 at 2:25 PM, Alejandro Guerrero <[email protected] 
> <javascript:>> wrote:
>
>> Thank you all for your thoughts and to Kyle for organizing!
>>
>> Sorry I didn't attend the call, but I didn't receive an invite. I'd be 
>> happy to join further calls.
>>
>> I am the co-creator of sparkmagic 
>> <https://github.com/jupyter-incubator/sparkmagic>, which relies on Livy 
>> <https://github.com/cloudera/livy>as the connection layer to Spark 
>> clusters. Sparkmagic provides Jupyter users with Python, Scala, and R 
>> kernels. All kernels have the same features:
>>
>>    - SparkSQL magic
>>    - Automatic visualizations
>>    - Ability to capture Spark dataframes into Pandas dataframes to be 
>>    visualized with any of Python's visualization libraries
>>
>> I agree that it's important for all of us to try to build a consistent 
>> experience for all Jupyter users. We started sparkmagic because we wanted a 
>> platform that would:
>>
>>    - Provide multiple language support for Spark at the same level
>>
>>
> What difficulties did you run into where support needed to be in Apache 
> Spark itself? There are at least two committers on the list and involved 
> now that are interested in improving the support of the libraries 
> themselves.
>  
>
>>
>>    - Provide a standardized visualization framework across kernels
>>    - Allow for users to change the Spark cluster that is being targeted 
>>    from the same Jupyter installation, without complicated network setups
>>    - Have the installation be as straightforward as possible
>>
>> <3
>
>>
>>    - Add a layer that could handle different authentication methods to 
>>    clusters (Joy's work on Kerberos 
>>    
>> <https://www.slideshare.net/SparkSummit/secured-kerberosbased-spark-notebook-for-data-science-spark-summit-east-talk-by-joy-chakraborty>
>>  authentication 
>>    is an example of this)
>>
>> We are happy with what we've achieved so far, but we would like to see 
>> the following things happen:
>>
>>    - Improvements on the auto-visualization framework. Today, we are 
>>    using ipywidgets and plotly to do the visualization, and this has led to 
>>    visualizations not to be preserved on documents. We would like to move 
>> away 
>>    from ipywidgets and go with a mimetype-based approach, where everyone can 
>>    converge.
>>
>>
> +1
>
> I'll respond to this a bit more in the section below on the new table/data 
> resource mimetype.
>  
>
>>
>>    - Progress bars/Spark application status/cancel buttons. We see these 
>>    features as ways for users to monitor cell progress and act on it. Today, 
>>    users get a fire code and hope everything is going well experience; 
>> looking 
>>    at job status requires several clicks, a different tab, and for you to 
>>    correlate what your cell is doing with what the Spark UI says.
>>
>> Ryan Blue can chime in on this one more, he ended up writing some custom 
> reprs for spark context and jobs for our use at Netflix.
>
> On the Jupyter side, we have a little bit of this outlined in the Spark 
> roadmap for Jupyter: 
> https://github.com/jupyter/roadmap/blob/master/spark.md; It would be 
> great to have more outlined there for us to iterate on.
>
>>
>>    - Cluster information. We've seen plenty of errors when clusters run 
>>    out of resources, and users do not know that the cluster was out of 
>>    resources, who's using them, or if they can clean up. We would love to 
>> have 
>>    a cluster status pane that allows users to understand the resource 
>>    utilization of a cluster (or other cluster if its status/characteristics 
>>    are better) and probably do some admin tasks on their clusters.
>>
>> This one is so important.
>  
>
>> Our team is concerned with Big Data support in Jupyter, so we have few 
>> opinions on a "Small Data" Scala kernel. I agree that it would be nice to 
>> separate languages from backends from an architectural standpoint. Having 
>> Jupyter libraries for JVM based kernels would be a step in the right 
>> direction. Adding Spark and other back ends as add-ons to kernels could 
>> also be a nice idea, provided we are wary of how these add-on's 
>> installation and configuration experience ends up looking like for end 
>> users. Spark, and I imagine other backends, require network access to all 
>> worker nodes from the driver.
>>
>
> For some organizations (mine included), we provide the network access 
> necessary and the spark binaries. We will likely never support Livy in our 
> environment. There should be plenty of room for people to use Livy though 
> and we can have focused efforts on deployment agnostic components 
> supporting Spark.
>  
>
>> I'm wary of the experience we'll create if we make kernels the driver and 
>> require kernels to be in the cluster. Livy solves a lot of that by making 
>> Livy the driver, which is collocated in a cluster, and for Jupyter to 
>> simply manage connection strings via sparkmagic. In the add-ons to kernels 
>> way of the world, how would a data scientist target different clusters or 
>> back ends? What kind of set up work does she have to do?
>>
>> On the visualizations front, I saw an effort to create a mimetype-based 
>> visualization library here: https://github.com/gnestor/jupyterlab_table
>> If all kernel, regardless of language (e.g. Python, R, Scala) were to 
>> output that mimetype, users would get a standard visualization library to 
>> use, and us devs could converge on it.
>>
>
> Over the weekend https://github.com/pandas-dev/pandas/pull/14904 was 
> merged which exports application/vnd.dataresource+json 
> <http://www.iana.org/assignments/media-types/application/vnd.dataresource+json>.
>  
> It's exactly what jupyterlab_table relies on as well as 
> https://github.com/nteract/nteract/pull/1534. I'm greatly looking forward 
> to this!
>
> I'd like to see some visualization built into the component that uses that 
> mimetype, possibly something polestar or lyra like (from the vega folks) as 
> well as some autovisualization. :D
>  
>
>>
>> Best,
>> Alejandro
>>
>> On Monday, March 6, 2017 at 8:58:40 AM UTC-8, Min RK wrote:
>>>
>>> This is awesome, thanks Kyle (and everyone)!
>>>
>>> On Fri, Mar 3, 2017 at 5:14 PM, Kyle Kelley <[email protected]> wrote:
>>>
>>>> On February 27, 2017 a group of us met to talk about Scala kernels and 
>>>> pave a path forward for Scala users. There is a youtube video available of 
>>>> the discussion available here:
>>>>
>>>> https://www.youtube.com/watch?v=0NRONVuct0E
>>>>
>>>> What follows is a summary from the call, mostly in linear order from 
>>>> the video itself.
>>>> Attendees
>>>>    
>>>>    - 
>>>>    
>>>>    Alexander Archambault - Jupyter Scala, Ammonium
>>>>    - 
>>>>    
>>>>    Ryan Blue (Netflix) - Toree
>>>>    - 
>>>>    
>>>>    Gino Bustelo (IBM) - Toree
>>>>    - 
>>>>    
>>>>    Joy Chakraborty (Bloomberg) - Spark Magic with Livy
>>>>    - 
>>>>    
>>>>    Kyle Kelley (Netflix) - Jupyter
>>>>    - 
>>>>    
>>>>    Haley Most (Cloudera) - Toree
>>>>    - 
>>>>    
>>>>    Marius van Niekerk (Maxpoint) - Toree, Spylon
>>>>    - 
>>>>    
>>>>    Peter Parente (Maxpoint) - Jupyter
>>>>    - 
>>>>    
>>>>    Corey Stubbs (IBM) - Toree
>>>>    - 
>>>>    
>>>>    Jamie Whitacre (Berkeley) - Jupyter
>>>>    - 
>>>>    
>>>>    Tristan Zajonc (Cloudera) - Toree, Livy
>>>>    
>>>>
>>>> Each of the people on the call has a preferred kernel, way of building 
>>>> it, and integrating it. We have a significant user experience problem in 
>>>> terms of users installing and using Scala kernels, beyond just Spark 
>>>> usage. 
>>>> The overarching goal is to create a cohesive experience for Scala users 
>>>> when they use Jupyter.
>>>>
>>>> When a Scala user tries to come to the Jupyter ecosystem (or even a 
>>>> familiar Python developer), they face many options for kernels. Being 
>>>> faced 
>>>> with choice when trying to get things done is creating new friction points 
>>>> for users. As examples see 
>>>> https://twitter.com/chrisalbon/status/833156959150841856 and 
>>>> https://twitter.com/sarah_guido/status/833165030296322049.
>>>> What are our foundations for REPL libraries in Scala?
>>>>
>>>> Toree was built on top of the Spark REPL and developers tried to use as 
>>>> much code as possible from Spark. For Alex’s jupyter-scala, he recognized 
>>>> that the Spark REPL was changing a lot from version to version. At the 
>>>> same 
>>>> time, Ammonite <https://github.com/lihaoyi/Ammonite> was created to 
>>>> assist in Scala scripting. In order to make big data frameworks such as 
>>>> Spark, Flink, and Scio to work well in this environment, a fork called 
>>>> Ammonium <https://github.com/alexarchambault/ammonium> was created. 
>>>> There is some amount of trepidation in using a separate fork as part of 
>>>> the 
>>>> kernel community. We should make sure to unify with the originating 
>>>> Ammonite and contribute back as part of a larger scala community that can 
>>>> maintain these together.
>>>> Action Items:
>>>>    
>>>>    - 
>>>>    
>>>>    Renew focus on Scala within Toree, improve outward messaging about 
>>>>    how Toree provides a scala kernel
>>>>    - 
>>>>    
>>>>    Unify Ammonite and Ammonium ([email protected])
>>>>    - 
>>>>       
>>>>       To be used in jupyter-scala, potentially for spylon
>>>>       
>>>> There is more than one implementation of the Jupyter protocol in the 
>>>> Java Stack.
>>>>
>>>> Toree has one, jupyter-scala does one, clojure kernels have their own. 
>>>> People would like to see a stable Jupyter library for the JVM. Some think 
>>>> it’s better to have one per language. Regardless of choice, we should have 
>>>> a well supported Jupyter library.
>>>> Action Items:
>>>>
>>>>    - 
>>>>    
>>>>    Create an idiomatic Java Library for the Jupyter messaging protocol 
>>>>    - propose this as an incubation project within Jupyter
>>>>    
>>>> Decouple Spark from Scala in kernels
>>>>
>>>> Decouple language specific parts from the computing framework to allow 
>>>> for using other computing frameworks. This is paramount for R and Python. 
>>>> When we inevitably want to connect to a GPU cluster, we want to be able to 
>>>> use the same foundations of a kernel. The reason that these end up being 
>>>> coupled is that Spark does “slightly weird things” for how it wants its 
>>>> classes compiled. It’s thought that there is some amount of specialization 
>>>> and that we can work around it. At the very least, we can bake it into the 
>>>> core and leave room for other frameworks to have solid built in support if 
>>>> necessary.
>>>>
>>>> An approach being worked on in Toree right now is lazy loading of 
>>>> spark. One concern that is different between jupyter-scala and Toree is 
>>>> that jupyter-scala can dynamically load spark versions whereas for Toree 
>>>> is 
>>>> bound to a version of Spark on deployment. For end users that have 
>>>> operators/admins, kernels can be configured per version of spark it will 
>>>> use (common for Python, R). Spark drives lots of interest in Scala kernel, 
>>>> many kernels conflate the two. This results in poor messaging and 
>>>> experiences for users getting started.
>>>> Action Items:
>>>>
>>>>    - 
>>>>    
>>>>    Lazy load spark within Toree
>>>>    
>>>> Focus efforts within kernel communities
>>>>
>>>> Larger in scope than just the Scala kernel, we need jupyter to 
>>>> acknowledge fully supported kernels. In contrast, the whole community in 
>>>> Zeppelin collaborates in one repository around their interpreters.
>>>>
>>>> “Fragmentation of kernels makes it harder for large enterprises to 
>>>> adopt them.”
>>>>
>>>> - Tristan Zajonc (Cloudera)
>>>>
>>>> Beyond the technical implementation of what is a supported kernel, we 
>>>> also need the messaging to end users to be simple and clear. There are 
>>>> several objectives we need to do to improve our messaging, organization, 
>>>> and technical underpinnings.
>>>> Action Items
>>>>
>>>>    - 
>>>>    
>>>>    On the Jupyter site provide blurbs and links to kernels for R, 
>>>>    Python, and Scala
>>>>    - 
>>>>    
>>>>    Create an organized effort around the Scala Kernel, possibly by 
>>>>    unifying in an organization while isolating projects in separate 
>>>>    repositories
>>>>    - 
>>>>    
>>>>    Align a specification of what it takes to be acknowledged as a 
>>>>    supported kernel
>>>>    
>>>> Visualization
>>>>
>>>> We would like to be able to push on the idea of mimetypes that output a 
>>>> hunk of JSON and are able to draw beautiful visualizations. Having these 
>>>> adopted in core Jupyter by default would go a long way towards providing 
>>>> simple just works visualization. The current landscape of 
>>>> visualization with the Scala kernels includes
>>>>
>>>>
>>>>    - 
>>>>    
>>>>    Vegas <https://github.com/vegas-viz/Vegas>
>>>>    - 
>>>>    
>>>>    Plotly Scala <https://github.com/alexarchambault/plotly-scala>
>>>>    - 
>>>>    
>>>>    Brunel <https://github.com/Brunel-Visualization/Brunel>
>>>>    - 
>>>>    
>>>>    Data Resource / Table Schema (see 
>>>>    https://github.com/pandas-dev/pandas/pull/14904)
>>>>    
>>>>
>>>> There is a bit of worry about standardization around the HTML outputs. 
>>>> Some libraries try to use frontend libraries that may not exist on the 
>>>> frontend or mismatch in version - jquery, requirejs, ipywidgets, jupyter, 
>>>> ipython. In some frontends, at times dictated by the operating 
>>>> environment, 
>>>> the HTML outputs must be in null origin iframes.
>>>> Action Items
>>>>    
>>>>    - 
>>>>    
>>>>    Continue involvement in Jupyter frontends to provide rich 
>>>>    visualization out of the box with less configuration and less friction
>>>>    
>>>> Standardizing display and reprs for Scala
>>>>
>>>> Since it’s likely that we there will still be multiple kernels 
>>>> available for the JVM, not just within Scala, we want to standardize the 
>>>> way in which you inspect objects in the JVM. IPython provides a way for 
>>>> libraries to integrate with IPython automatically for users. We want 
>>>> library developers to be able to follow a common scheme and be well 
>>>> represented regardless of the kernel.
>>>> Action Items:
>>>>    
>>>>    - Create a specification for object representation for JVM 
>>>>    languages as part of the Jupyter project
>>>>
>>>>
>>>> -- 
>>>> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; lambdaops.com)
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Project Jupyter" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/jupyter/CA%2BtbMaUQzt4tb9HVtEKaxrpmGib%3DbENhoYk%3D910vc01oid%3DNhA%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/jupyter/CA%2BtbMaUQzt4tb9HVtEKaxrpmGib%3DbENhoYk%3D910vc01oid%3DNhA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Project Jupyter" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/jupyter/b542ccd0-0b40-4518-8a52-009abe12af8b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/jupyter/b542ccd0-0b40-4518-8a52-009abe12af8b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; lambdaops.com)
>

-- 
You received this message because you are subscribed to the Google Groups 
"Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jupyter/714f24cc-fa52-4311-9ea6-9b796188bf41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [jupyter] Scala Kernel Discussion

Reply via email to