AV created ZEPPELIN-3927:
----------------------------

             Summary: Unstable State running Code
                 Key: ZEPPELIN-3927
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3927
             Project: Zeppelin
          Issue Type: Bug
          Components: zeppelin-interpreter
    Affects Versions: 0.9.0
            Reporter: AV


Executing the tutorial notebook code produces weird results using Spark 2.4.0:

> import org.apache.commons.io.IOUtils
> import java.net.URL
> import java.nio.charset.Charset
>
>
> // Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext 
> or SqlContext)
> // So you don't need create them manually
>
> // Remote Address
> val csvURL = 
> "https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv";;
>
> // Parallel processing
> val bankText = sc.parallelize( IOUtils.toString( new URL(csvURL), 
> Charset.forName("UTF-8") ).toString().split("\n") )
>
> case class Bank(age: Integer, job: String, marital: String, education: 
> String, balance: Integer)
>
> val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
>    s => Bank(s(0).toInt, 
>            s(1).replaceAll("\"", ""),
>            s(2).replaceAll("\"", ""),
>            s(3).replaceAll("\"", ""),
>            s(5).replaceAll("\"", "").toInt
>        )
> ).toDF()
>
> bank.registerTempTable("bank")

 

In the first run (after an spark interpreter restart), everything works fine, 
the output is:

> warning: there was one deprecation warning; re-run with -deprecation for 
> details

> import sqlContext.implicits._

> import org.apache.commons.io.IOUtils

> import java.net.URL

> import java.nio.charset.Charset csvURL: String = 
> [https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv]

> bankText: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0]

> at parallelize at <console>:28 defined class Bank bank: 
> org.apache.spark.sql.DataFrame = [age: int, job: string ... 3 more fields]

 

After the code has been executed once any re-run fails:

> warning: there was one deprecation warning; re-run with -deprecation for 
> details

> java.lang.IllegalArgumentException: URI is not absolute

> at java.net.URI.toURL(URI.java:1088)

> at 
> org.apache.hadoop.fs.http.AbstractHttpFileSystem.open(AbstractHttpFileSystem.java:60)

> at org.apache.hadoop.fs.http.HttpsFileSystem.open(HttpsFileSystem.java:23)

> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)

> at org.apache.hadoop.fs.FsUrlConnection.connect(FsUrlConnection.java:50)

> at 
> org.apache.hadoop.fs.FsUrlConnection.getInputStream(FsUrlConnection.java:59)

> at java.net.URL.openStream(URL.java:1045)

> at org.apache.commons.io.IOUtils.toString(IOUtils.java:894) ... 39 elided

 

The deprecation warning:

> <console>:36: error: value toDF is not a member of 
> org.apache.spark.rdd.RDD[Bank]

> possible cause: maybe a semicolon is missing before `value toDF'? ).toDF()

 

Any ideas?

 

ps.: I'm a little bit curious why there are no other messages regarding my 
problems. Using the latest stable spark / hadoop releases when compiling from 
source is natural for me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to