[kudu-CR] [examples] Add basic Spark example (scala)

Mitch Barnett (Code Review) Mon, 29 Oct 2018 15:04:00 -0700

Mitch Barnett has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11788 )


Change subject: [examples] Add basic Spark example (scala)
......................................................................


Patch Set 10:

(41 comments)

http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@9
PS8, Line 9: Kudu-Spark
> You mean it adds a basic kudu-spark example?
I think I copied this from the Kudu-Java example patch notes.... Fixed


http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@11
PS8, Line 11: It will allow users to pull down the pom.xml and scala source,
> nit: No paragraph break.
Pushing to gerrit throws a warn message about line breaks, which is why I added 
this in. Safe to ignore and leave it as a single line without width 
considerations, or is there a better way to meet the width limitation?


http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@11
PS8, Line 11: users to
> Say "users" instead of "customers". Nobody pays Apache to use Apache Kudu.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc
File examples/scala/spark-example/README.adoc:

http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@22
PS8, Line 22: the Kudu-Spark i
> This is misleading. It doesn't use the KuduClient directly. Instead it uses
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@27
PS8, Line 27: Scan some rows
> and then split this list item into two:
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@37
PS8, Line 37: $ mvn package
> The above runs against a local cluster? What does it do?
Not quite - it will run as a local execution, and build the 'spark' 
architecture locally when executed. There's no cluster to run against here, 
it's effectively a standalone application running in it's own JVM.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@38
PS8, Line 38: et/kudu-spark-example-1
> More precisely, these are Java system properties. Say that, since many user
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40
PS8, Line 40:
> Remove
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40
PS8, Line 40:
> Remove this first part of the sentence. Parameters from the command line al
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40
PS8, Line 40:
> nit: master
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@41
PS8, Line 41: ere are a few Java system properties defined in 
SparkExample.scala:
> I think this is obvious and should be omitted.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@41
PS8, Line 41:
> Remove, so it's just "Defaults to..."
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@43
PS8, Line 43: A comma-separated list of
> Remove.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@43
PS8, Line 43:  master
> You mean the address, right?
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@46
PS8, Line 46: park2 'spark-submit' job,
> Remove.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@47
PS8, Line 47: the '--master' Spark para
> Replace with "Defaults to".
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@49
PS8, Line 49:   Defaults to 'spark_test'.
> Omit this sentence since it's a generality about Java system properties, an
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@51
PS8, Line 51:  different set of masters for the Kudu cluster from the
            : command line and use a custom table name, use a c
> Replace with just "use a command like the following:"
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@59
PS8, Line 59:
> Do we have to care whether Spark is running on YARN vs anything else? Seems
Yep, this was corrected in a later revision. I misunderstood some of the 
previous comments and made this a bit too CDH-centric. This should now just 
read:

"To run this as a spark2 'spark-submit' job"


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@60
PS8, Line 60: ubmit command
> What does this mean?
In the spark-submit, you supply the spark-master address as part of the 
command-line options. In order to make this work between both spark-submit and 
also a standalone java execution, I need to have the '--master' value from the 
spark options also be supplied to the Kudu-Spark example via '-DSparkMaster='.

Can explain in more detail in chat/call if needed.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala
File 
examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala:

http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@19
PS8, Line 19: import collection.JavaConverters._
            :
            : import org.slf4j.LoggerFactory
            :
            : import org.apache.kudu.client._
            : import org.apache.kudu.spark.kudu._
            :
            : import org.apache.spark.sql.Spa
> Double-check that you are complying with the import rules at https://github
Looks correct to me, but please call out any misuse.

Placed the JavaConverters first, rearranged LoggerFactory to come before 
org.apache.kudu, and had org.apache.spark last. Only calling required 
definitions unless there are multiple required.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32
PS8, Line 32: kuduMasters
> As I commented elsewhere, can we name these parameters camelCase instead of
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32
PS8, Line 32: kuduMasters
> I know Grant said these should be capitalized like this if they are constan
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32
PS8, Line 32: // Kud
> Space between the // and the first word
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@40
PS8, Line 40: // Defining a class that we'll use to insert data into the table
> It seems like there's benefits to using a case class-- the RDD toDF() calls
Changed this to be the following:

  // Defining a class that we'll use to insert data into the table.
  // Because we're defining a case class here, we circumvent the need to 
explicitly define a schema later
  // in the code, like during our RDD -> toDF() calls later on.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@45
PS8, Line 45:
> I think this comment is redundant and can be omitted, unless you are going
Added the following - let me know if this is insufficient:

    // Define our session and context variables for use throughout the program.
    // The kuduContext is a serializable container for Kudu client connections, 
while the
    // SparkSession is the entry point to SparkSQL and the Dataset/DataFrame 
API.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@47
PS8, Line 47:
> Rename to kuduContext for maximum readability.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@49
PS8, Line 49: // The kuduContext is a serializable container for Kudu client 
connections,
            :     // while the SparkSessio
> Can you import this at the top? I feel like this is an antipattern.
It is an anti-pattern, it's ugly - and there's no way to move it that I've been 
able to figure out.

It's even in spark's own documentation as such: 
https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#hive-tables

Moving that import fails the build:
[ERROR] 
/Users/mbarnett/src/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala:14:
 error: not found: object spark
[ERROR] import spark.implicits._
[ERROR]        ^
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE

because it's using the 'spark' instance we declare above it, which actually 
contains the lib.

This link explains it a bit better: 
https://stackoverflow.com/questions/39968707/spark-2-0-missing-spark-implicits


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@52
PS8, Line 52: l kuduContext = new KuduContext(kuduMasters, spark.sq
> nit: I'd just say "The schema of the table we're going to use."
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@53
PS8, Line 53:
> This is obvious from the code and doesn't need to be comment.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@62
PS8, Line 62:                 StructField(nameCol, String
> Remove, so it's just "Create the..."
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@64
PS8, Line 64:
> Please split into lines of 80 characters or less. Occasional overflows up t
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@68
PS8, Line 68: if (!kuduContext.tableExists(tableName)) {
> This restates the code. Remove.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@76
PS8, Line 76: val userRDD = spark.sparkContext.parallelize(data)
            :       val userDF = userRDD.toDF()
> Remove. Don't just restate the code in comments-- explain it.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@78
PS8, Line 78:       kuduContext.insertRows(userDF, tableName)
> Add a log message here: "Reading back the rows written" or something.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@81
PS8, Line 81: fo("Reading back the rows just
> Does .show() not work?
I don't think there's a .show() call for RDD's, is there? readRDD.map returns 
an RDD - as far as I know, .show() only exists on a DataFrame or called 
directly through sqlContext (as done below).


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@84
PS8, Line 84:  val userTuple = readRDD.map { case Row(name: String, id: Int) => 
(name, id) }
> Remove.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@93
PS8, Line 93:
> Remove.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@96
PS8, Line 96: > kuduMasters, "kudu.table" -> tableName)).kudu
> This would be nicer if you used https://docs.scala-lang.org/overviews/core/
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@97
PS8, Line 97:       sqlDF.createOrReplaceTempView(tableName)
> Extra newline.
Done


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@101
PS8, Line 101: nally {
> Remove, or replace with "Clean up."
Done



--
To view, visit http://gerrit.cloudera.org:8080/11788
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ba09f0118c054a07b951e241c31d66245c57d3f
Gerrit-Change-Number: 11788
Gerrit-PatchSet: 10
Gerrit-Owner: Mitch Barnett <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Greg Solovyev <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mitch Barnett <[email protected]>
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Mon, 29 Oct 2018 22:03:22 +0000
Gerrit-HasComments: Yes

[kudu-CR] [examples] Add basic Spark example (scala)

Reply via email to