Re: [jupyter] Scala Kernel Discussion

Alejandro Guerrero Mon, 06 Mar 2017 14:25:43 -0800

Thank you all for your thoughts and to Kyle for organizing!

Sorry I didn't attend the call, but I didn't receive an invite. I'd be 
happy to join further calls.


I am the co-creator of sparkmagic 
<https://github.com/jupyter-incubator/sparkmagic>, which relies on Livy 
<https://github.com/cloudera/livy>as the connection layer to Spark 
clusters. Sparkmagic provides Jupyter users with Python, Scala, and R 
kernels. All kernels have the same features:

   - SparkSQL magic
   - Automatic visualizations
   - Ability to capture Spark dataframes into Pandas dataframes to be 
   visualized with any of Python's visualization libraries

I agree that it's important for all of us to try to build a consistent 
experience for all Jupyter users. We started sparkmagic because we wanted a 
platform that would:

   - Provide multiple language support for Spark at the same level
   - Provide a standardized visualization framework across kernels
   - Allow for users to change the Spark cluster that is being targeted 
   from the same Jupyter installation, without complicated network setups
   - Have the installation be as straightforward as possible
   - Add a layer that could handle different authentication methods to 
   clusters (Joy's work on Kerberos 
   
<https://www.slideshare.net/SparkSummit/secured-kerberosbased-spark-notebook-for-data-science-spark-summit-east-talk-by-joy-chakraborty>
 authentication 
   is an example of this)

We are happy with what we've achieved so far, but we would like to see the 
following things happen:

   - Improvements on the auto-visualization framework. Today, we are using 
   ipywidgets and plotly to do the visualization, and this has led to 
   visualizations not to be preserved on documents. We would like to move away 
   from ipywidgets and go with a mimetype-based approach, where everyone can 
   converge.
   - Progress bars/Spark application status/cancel buttons. We see these 
   features as ways for users to monitor cell progress and act on it. Today, 
   users get a fire code and hope everything is going well experience; looking 
   at job status requires several clicks, a different tab, and for you to 
   correlate what your cell is doing with what the Spark UI says.
   - Cluster information. We've seen plenty of errors when clusters run out 
   of resources, and users do not know that the cluster was out of resources, 
   who's using them, or if they can clean up. We would love to have a cluster 
   status pane that allows users to understand the resource utilization of a 
   cluster (or other cluster if its status/characteristics are better) and 
   probably do some admin tasks on their clusters.

Our team is concerned with Big Data support in Jupyter, so we have few 
opinions on a "Small Data" Scala kernel. I agree that it would be nice to 
separate languages from backends from an architectural standpoint. Having 
Jupyter libraries for JVM based kernels would be a step in the right 
direction. Adding Spark and other back ends as add-ons to kernels could 
also be a nice idea, provided we are wary of how these add-on's 
installation and configuration experience ends up looking like for end 
users. Spark, and I imagine other backends, require network access to all 
worker nodes from the driver. I'm wary of the experience we'll create if we 
make kernels the driver and require kernels to be in the cluster. Livy 
solves a lot of that by making Livy the driver, which is collocated in a 
cluster, and for Jupyter to simply manage connection strings via 
sparkmagic. In the add-ons to kernels way of the world, how would a data 
scientist target different clusters or back ends? What kind of set up work 
does she have to do?

On the visualizations front, I saw an effort to create a mimetype-based 
visualization library here: https://github.com/gnestor/jupyterlab_table
If all kernel, regardless of language (e.g. Python, R, Scala) were to 
output that mimetype, users would get a standard visualization library to 
use, and us devs could converge on it.

Best,
Alejandro

On Monday, March 6, 2017 at 8:58:40 AM UTC-8, Min RK wrote:
>
> This is awesome, thanks Kyle (and everyone)!
>
> On Fri, Mar 3, 2017 at 5:14 PM, Kyle Kelley <[email protected] 
> <javascript:>> wrote:
>
>> On February 27, 2017 a group of us met to talk about Scala kernels and 
>> pave a path forward for Scala users. There is a youtube video available of 
>> the discussion available here:
>>
>> https://www.youtube.com/watch?v=0NRONVuct0E
>>
>> What follows is a summary from the call, mostly in linear order from the 
>> video itself.
>> Attendees
>>    
>>    - 
>>    
>>    Alexander Archambault - Jupyter Scala, Ammonium
>>    - 
>>    
>>    Ryan Blue (Netflix) - Toree
>>    - 
>>    
>>    Gino Bustelo (IBM) - Toree
>>    - 
>>    
>>    Joy Chakraborty (Bloomberg) - Spark Magic with Livy
>>    - 
>>    
>>    Kyle Kelley (Netflix) - Jupyter
>>    - 
>>    
>>    Haley Most (Cloudera) - Toree
>>    - 
>>    
>>    Marius van Niekerk (Maxpoint) - Toree, Spylon
>>    - 
>>    
>>    Peter Parente (Maxpoint) - Jupyter
>>    - 
>>    
>>    Corey Stubbs (IBM) - Toree
>>    - 
>>    
>>    Jamie Whitacre (Berkeley) - Jupyter
>>    - 
>>    
>>    Tristan Zajonc (Cloudera) - Toree, Livy
>>    
>>
>> Each of the people on the call has a preferred kernel, way of building 
>> it, and integrating it. We have a significant user experience problem in 
>> terms of users installing and using Scala kernels, beyond just Spark usage. 
>> The overarching goal is to create a cohesive experience for Scala users 
>> when they use Jupyter.
>>
>> When a Scala user tries to come to the Jupyter ecosystem (or even a 
>> familiar Python developer), they face many options for kernels. Being faced 
>> with choice when trying to get things done is creating new friction points 
>> for users. As examples see 
>> https://twitter.com/chrisalbon/status/833156959150841856 and 
>> https://twitter.com/sarah_guido/status/833165030296322049.
>> What are our foundations for REPL libraries in Scala?
>>
>> Toree was built on top of the Spark REPL and developers tried to use as 
>> much code as possible from Spark. For Alex’s jupyter-scala, he recognized 
>> that the Spark REPL was changing a lot from version to version. At the same 
>> time, Ammonite <https://github.com/lihaoyi/Ammonite> was created to 
>> assist in Scala scripting. In order to make big data frameworks such as 
>> Spark, Flink, and Scio to work well in this environment, a fork called 
>> Ammonium <https://github.com/alexarchambault/ammonium> was created. 
>> There is some amount of trepidation in using a separate fork as part of the 
>> kernel community. We should make sure to unify with the originating 
>> Ammonite and contribute back as part of a larger scala community that can 
>> maintain these together.
>> Action Items:
>>    
>>    - 
>>    
>>    Renew focus on Scala within Toree, improve outward messaging about 
>>    how Toree provides a scala kernel
>>    - 
>>    
>>    Unify Ammonite and Ammonium ([email protected] <javascript:>)
>>    - 
>>       
>>       To be used in jupyter-scala, potentially for spylon
>>       
>> There is more than one implementation of the Jupyter protocol in the Java 
>> Stack.
>>
>> Toree has one, jupyter-scala does one, clojure kernels have their own. 
>> People would like to see a stable Jupyter library for the JVM. Some think 
>> it’s better to have one per language. Regardless of choice, we should have 
>> a well supported Jupyter library.
>> Action Items:
>>
>>    - 
>>    
>>    Create an idiomatic Java Library for the Jupyter messaging protocol - 
>>    propose this as an incubation project within Jupyter
>>    
>> Decouple Spark from Scala in kernels
>>
>> Decouple language specific parts from the computing framework to allow 
>> for using other computing frameworks. This is paramount for R and Python. 
>> When we inevitably want to connect to a GPU cluster, we want to be able to 
>> use the same foundations of a kernel. The reason that these end up being 
>> coupled is that Spark does “slightly weird things” for how it wants its 
>> classes compiled. It’s thought that there is some amount of specialization 
>> and that we can work around it. At the very least, we can bake it into the 
>> core and leave room for other frameworks to have solid built in support if 
>> necessary.
>>
>> An approach being worked on in Toree right now is lazy loading of spark. 
>> One concern that is different between jupyter-scala and Toree is that 
>> jupyter-scala can dynamically load spark versions whereas for Toree is 
>> bound to a version of Spark on deployment. For end users that have 
>> operators/admins, kernels can be configured per version of spark it will 
>> use (common for Python, R). Spark drives lots of interest in Scala kernel, 
>> many kernels conflate the two. This results in poor messaging and 
>> experiences for users getting started.
>> Action Items:
>>
>>    - 
>>    
>>    Lazy load spark within Toree
>>    
>> Focus efforts within kernel communities
>>
>> Larger in scope than just the Scala kernel, we need jupyter to 
>> acknowledge fully supported kernels. In contrast, the whole community in 
>> Zeppelin collaborates in one repository around their interpreters.
>>
>> “Fragmentation of kernels makes it harder for large enterprises to adopt 
>> them.”
>>
>> - Tristan Zajonc (Cloudera)
>>
>> Beyond the technical implementation of what is a supported kernel, we 
>> also need the messaging to end users to be simple and clear. There are 
>> several objectives we need to do to improve our messaging, organization, 
>> and technical underpinnings.
>> Action Items
>>
>>    - 
>>    
>>    On the Jupyter site provide blurbs and links to kernels for R, 
>>    Python, and Scala
>>    - 
>>    
>>    Create an organized effort around the Scala Kernel, possibly by 
>>    unifying in an organization while isolating projects in separate 
>>    repositories
>>    - 
>>    
>>    Align a specification of what it takes to be acknowledged as a 
>>    supported kernel
>>    
>> Visualization
>>
>> We would like to be able to push on the idea of mimetypes that output a 
>> hunk of JSON and are able to draw beautiful visualizations. Having these 
>> adopted in core Jupyter by default would go a long way towards providing 
>> simple just works visualization. The current landscape of visualization 
>> with the Scala kernels includes
>>
>>
>>    - 
>>    
>>    Vegas <https://github.com/vegas-viz/Vegas>
>>    - 
>>    
>>    Plotly Scala <https://github.com/alexarchambault/plotly-scala>
>>    - 
>>    
>>    Brunel <https://github.com/Brunel-Visualization/Brunel>
>>    - 
>>    
>>    Data Resource / Table Schema (see 
>>    https://github.com/pandas-dev/pandas/pull/14904)
>>    
>>
>> There is a bit of worry about standardization around the HTML outputs. 
>> Some libraries try to use frontend libraries that may not exist on the 
>> frontend or mismatch in version - jquery, requirejs, ipywidgets, jupyter, 
>> ipython. In some frontends, at times dictated by the operating environment, 
>> the HTML outputs must be in null origin iframes.
>> Action Items
>>    
>>    - 
>>    
>>    Continue involvement in Jupyter frontends to provide rich 
>>    visualization out of the box with less configuration and less friction
>>    
>> Standardizing display and reprs for Scala
>>
>> Since it’s likely that we there will still be multiple kernels available 
>> for the JVM, not just within Scala, we want to standardize the way in which 
>> you inspect objects in the JVM. IPython provides a way for libraries to 
>> integrate with IPython automatically for users. We want library developers 
>> to be able to follow a common scheme and be well represented regardless of 
>> the kernel.
>> Action Items:
>>    
>>    - Create a specification for object representation for JVM languages 
>>    as part of the Jupyter project
>>
>>
>> -- 
>> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; lambdaops.com)
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Project Jupyter" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/jupyter/CA%2BtbMaUQzt4tb9HVtEKaxrpmGib%3DbENhoYk%3D910vc01oid%3DNhA%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/jupyter/CA%2BtbMaUQzt4tb9HVtEKaxrpmGib%3DbENhoYk%3D910vc01oid%3DNhA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jupyter/b542ccd0-0b40-4518-8a52-009abe12af8b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [jupyter] Scala Kernel Discussion

Reply via email to