GitHub user doanduyhai opened a pull request:
https://github.com/apache/incubator-zeppelin/pull/80
Add Scala utility functions for display
Until now, to display data as a table, there are 2 alternatives:
1. Either use **Spark DataFrame** and Zeppeline built-in support
2. Or generate manually a `println(%table ...)`. As an example of
displaying an `RDD[(String,String,Int)]` representing a collection of users:
```scala
val data = new java.lang.StringBuilder("%table Login\tName\tAge\n")
rdd.foreach {
case (login,name,age) => data.append(s"$login\t$name\t$age\n")
}
println(data.toString())
```
My proposal is to add a new utility function to make creating tables easier
that the code example above. Of course one can always use **Spark DataFrame**
but I find it quite restrictive. People using Spark versions lesser than 1.3
cannot rely on DataFrame and sometimes one does not want to transform an RDD to
DataFrame for display.
How are the utility functions implemented ?
1. I added a new module **spark-utils** which provide Scala code for
display utility functions. This module will use the **maven-scala-plugin** to
compile all the classes in package `org.apache.zeppelin.spark.utils`.
2. Right now the package `org.apache.zeppelin.spark.utils` only contains 1
object `DisplayUtils` which augments RDDs of Tuples or RDDs of Scala case
classes (all of them sub-class of trait `Product`) with the new method
`displayAsTable(columnLabels: String*)`.
3. The `DisplayUtils` object is imported automatically into the
`SparkInterpreter` with `intp.interpret("import
org.apache.zeppelin.spark.utils.DisplayUtils._");`
4. The Maven module **interpreter** will now have a **runtime** dependency
on the module **spark-utils** so that the utility class will be loaded at
runtime
5. Usage of the new display utility function is:
**Paragraph1**
```scala
case class Person(login: String, name: String, age: Int)
val rddTuples:RDD[(String,String,Int)] =
sc.parallelize(List(("jdoe","John DOE",32),("hsue","Helen SUE",27))
val rddCaseClass:RDD[(String,String,Int)] =
sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))
```
**Paragraph2**
```scala
rddTuples.displayAsTable("Login","Name","Age")
```
**Paragraph3**
```scala
rddCaseClass.displayAsTable("Login","Name","Age")
```
6. The `displayAsTable()` method is error-proof, meaning that if the user
provides **more** columns label that the number of elements in the tuples/case
class, the extra column labels will ignored. If the user provides **less**
column labels than expected, the method will pad missing column headers with
**Column2**, **Column3** etc ...
7. In addition to the `displayAsTable` methods, I added some other utility
methods to make it easier to handle custom HTML and images:
a. calling `html()` will generate the string `"%html "`
b. calling `html("<p> This is a test</p>)` will generate the string
`"%html <p> This is a test</p>"`
c. calling `img("http://www.google.com")` will generate the string
`"<img src='http://www.google.com' />"`
d. calling `img64()` will generate the string `"%img "`
e. calling `img64("ABCDE123")` will generate the string `"%img ABCDE123"`
Of course the `DisplayUtils` object can be extended with new other
functions to support future advanced displaying features
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/doanduyhai/incubator-zeppelin DisplayUtils
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-zeppelin/pull/80.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #80
----
commit 5edab7130e70cf9f765dc268648d8ef294251b37
Author: DuyHai DOAN <[email protected]>
Date: 2015-05-23T19:49:33Z
Add new module spark-utils to expose utility functions for display
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---