Pasha Finkeshteyn created ZEPPELIN-5222: -------------------------------------------
Summary: Zeppelin hangs on simple query with datasets Key: ZEPPELIN-5222 URL: https://issues.apache.org/jira/browse/ZEPPELIN-5222 Project: Zeppelin Issue Type: Bug Components: spark Affects Versions: 0.9.0, 0.8.2 Environment: OS: Linux (tried Manjaro and Ubuntu) Zeppelin version: 0.9 release and 0.8.2 Java version: 8 and 11 Reporter: Pasha Finkeshteyn Query {code:scala} case class Movie(movieId: Long, title: String, genres: String) case class MovieWithGenresAndYear(movieId: Long, title: String, genres: List[String], year: Integer) case class MovieExploded(movieId: Long, title: String, genres: List[String]) case class MovieAggregate(year: Int, count: Long) import spark.implicits._ val df = spark .read .option("header", true) .option("inferSchema", true) .option("mode", "DROPMALFORMED") .csv("/home/finkel/Downloads/ml-latest/movies.csv") .as[Movie] .map(it => MovieExploded(it.movieId, it.title, it.genres.split('|').map(_.trim).toList)) .map { case MovieExploded(movieId, title, genres) => if (!title.matches("\"?.*\\(\\d{4}\\)\\s*\"?")) MovieWithGenresAndYear(movieId, title, genres, null) else { val lastOpen = title.lastIndexOf('(') val year = title.substring(lastOpen + 1).replace(")", "").replace("\"", "").trim.toInt MovieWithGenresAndYear(movieId, title.substring(0, lastOpen), genres, year) } } .filter(_.year != null) .groupByKey(_.year) .mapGroups((k, v) => (k, v.size) ) .show(300, false) {code} It hangs forever with a simple data {code:csv} movieId,title,genres 1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy 2,Jumanji (1995),Adventure|Children|Fantasy 3,Grumpier Old Men (1995),Comedy|Romance {code} The very same query works momentarily in Spark Shell. Can't reproduce on Mac -- This message was sent by Atlassian Jira (v8.3.4#803005)