[kudu-CR] [examples] Add basic Spark example (scala)

Will Berkeley (Code Review) Mon, 29 Oct 2018 13:03:02 -0700

Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11788 )


Change subject: [examples] Add basic Spark example (scala)
......................................................................


Patch Set 9:

(44 comments)

Still need to try running it. Will respond again after I have.

http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@9
PS8, Line 9: Kudu client
You mean it adds a basic kudu-spark example?


http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@11
PS8, Line 11: It will allow customers to pull down the pom.xml and scala source,
nit: No paragraph break.


http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@11
PS8, Line 11: customers
Say "users" instead of "customers". Nobody pays Apache to use Apache Kudu. Or, 
if they do, I'm not getting my take and I'm pretty mad -.-


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc
File examples/scala/spark-example/README.adoc:

http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@22
PS8, Line 22: Kudu client APIs
This is misleading. It doesn't use the KuduClient directly. Instead it uses the 
methods on KuduContext, which is the recommended way to do things. I would say 
instead that the program "uses the Kudu-Spark integration to:"


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@27
PS8, Line 27: Scan some rows
and then split this list item into two:

- Scan rows using RDD / DataFrame methods
- Scan rows using SparkSQL


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@37
PS8, Line 37:
The above runs against a local cluster? What does it do?


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@38
PS8, Line 38: configurable parameters
More precisely, these are Java system properties. Say that, since many users 
will understand what this means.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40
PS8, Line 40: Host
Remove


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40
PS8, Line 40: Master
nit: master


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40
PS8, Line 40: A String value consisting of
Remove this first part of the sentence. Parameters from the command line always 
start as strings.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@41
PS8, Line 41: This
Remove, so it's just "Defaults to..."


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@41
PS8, Line 41: This will need to be pointed to the Kudu cluster you wish to 
target.
I think this is obvious and should be omitted.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@43
PS8, Line 43: location
You mean the address, right?


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@43
PS8, Line 43: A String value conatining
Remove.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@46
PS8, Line 46: A String value containing
Remove.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@47
PS8, Line 47: The default table name is
Replace with "Defaults to".


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@49
PS8, Line 49: To specify a value at execution time, you'll specify the 
parameter name in the '-D<name>' format.
Omit this sentence since it's a generality about Java system properties, and 
the example demonstrates the syntax anyway.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@51
PS8, Line 51: set the property `KuduMasters` to a CSV of the master addresses 
in the form
            : `host:port` and add a table name value, as shown:
Replace with just "use a command like the following:"


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@56
PS8, Line 56: KuduMasters
I don't like PascalCase for these parameters. We have the same kuduMaster 
parameter for the java example except it's named camelCase.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@56
PS8, Line 56: TableName
tableName


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@56
PS8, Line 56: target/kudu-spark-example-1.0-SNAPSHOT.jar
Split the command so it's 80 or less characters per line. Remember to use \ to 
break bash commands into multiple lines properly.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@59
PS8, Line 59: rk2 'spark-sub
Do we have to care whether Spark is running on YARN vs anything else? Seems 
like the answer should be no. What do you do if you are running on Mesos? Is 
that still a thing for Spark 2?


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@60
PS8, Line 60: your '--master' parameter
What does this mean?


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala
File 
examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala:

http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@19
PS8, Line 19: import collection.JavaConverters._
            :
            : import org.apache.kudu.client._
            : import org.apache.kudu.spark.kudu._
            :
            : import org.apache.spark.sql.SparkSession
            : import org.apache.spark.sql.types.{IntegerType, StringType, 
StructField, StructType}
            : import org.apache.spark.sql.Row
Double-check that you are complying with the import rules at 
https://github.com/databricks/scala-style-guide#imports.

We don't have a chosen Scala style guide, so I'm choosing the Databricks one 
for this file because it's what's used for Spark itself.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32
PS8, Line 32: //Kudu
Space between the // and the first word

    // Kudu master address list.

Here and in most every other comment in the file.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32
PS8, Line 32: KuduMasters
As I commented elsewhere, can we name these parameters camelCase instead of 
PascalCase?

    kuduMasters
    tableName
    sparkMaster

This will be consistent with the Java example.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32
PS8, Line 32: KuduMasters
I know Grant said these should be capitalized like this if they are constants, 
but they aren't. They are immutable values retrieved from the environment, so I 
think they should also be camelCase.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@40
PS8, Line 40: //Defining a class that we'll use to insert data into the table.
It seems like there's benefits to using a case class-- the RDD toDF() calls 
below don't need an explicit schema, for example. Maybe this comment should 
explain we're using a case class for this reason? This will help newer people 
make good decisions about data modeling with Spark.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@45
PS8, Line 45: //Define our session and context variables for use throughout the 
program.
I think this comment is redundant and can be omitted, unless you are going to 
explain more about what a KuduContext is. Maybe you should say explicitly that 
all interaction with Kudu should be through the KuduContext, or in extremis use 
the KuduClient obtainable from KuduContext#kuduClient?


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@47
PS8, Line 47: kc
Rename to kuduContext for maximum readability.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@49
PS8, Line 49: //Importing a class from the SparkSession we instantiated above, 
to allow for easier RDD -> DF conversions.
            :     import spark.implicits._
Can you import this at the top? I feel like this is an antipattern.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@52
PS8, Line 52: Define the schema of the table we're going to create.
nit: I'd just say "The schema of the table we're going to use."


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@53
PS8, Line 53: //This will be a basic schema, with only two columns: 'name' and 
'id'
This is obvious from the code and doesn't need to be comment.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@62
PS8, Line 62: xecute the command from the KuduContext to
Remove, so it's just "Create the..."


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@64
PS8, Line 64: .setNumReplicas(1).addHashPartitions(List(IdCol).asJava, 3))
Please split into lines of 80 characters or less. Occasional overflows up to 
100 characters are ok if splitting the line makes it harder to read.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@68
PS8, Line 68: //Create an array of User objects, put them into an RDD, convert 
the RDD to a DF and then insert the rows.
This restates the code. Remove.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@76
PS8, Line 76: //Specify the columns of the table we want to read, push those 
into a kuduRDD object for the current SparkContext and the table.
            :       //Map the values in the RDD to give us a tuple format, and 
then print the results.
Remove. Don't just restate the code in comments-- explain it.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@78
PS8, Line 78:       val readCols = Seq(NameCol, IdCol)
Add a log message here: "Reading back the rows written" or something.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@81
PS8, Line 81: .collect().foreach(println(_))
Does .show() not work?


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@84
PS8, Line 84:  //Create another Array of User objects to upsert, pass them into 
an RDD, convert it to a DF and upsert it to the table specified.
Remove.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@93
PS8, Line 93: //Create a DF and map the values of the KuduMaster and TableName 
values.
Remove.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@96
PS8, Line 96: "select * from " + TableName + " where " + IdCol + " > 1000"
This would be nicer if you used 
https://docs.scala-lang.org/overviews/core/string-interpolation.html.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@97
PS8, Line 97:
Extra newline.


http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@101
PS8, Line 101: //Delete the table and shutdown the spark session.
Remove, or replace with "Clean up."



--
To view, visit http://gerrit.cloudera.org:8080/11788
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ba09f0118c054a07b951e241c31d66245c57d3f
Gerrit-Change-Number: 11788
Gerrit-PatchSet: 9
Gerrit-Owner: Mitch Barnett <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Greg Solovyev <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mitch Barnett <[email protected]>
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Mon, 29 Oct 2018 20:02:51 +0000
Gerrit-HasComments: Yes

[kudu-CR] [examples] Add basic Spark example (scala)

Reply via email to