Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/11788 )
Change subject: [examples] Add basic Spark example (scala) ...................................................................... Patch Set 9: (44 comments) Still need to try running it. Will respond again after I have. http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@9 PS8, Line 9: Kudu client You mean it adds a basic kudu-spark example? http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@11 PS8, Line 11: It will allow customers to pull down the pom.xml and scala source, nit: No paragraph break. http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@11 PS8, Line 11: customers Say "users" instead of "customers". Nobody pays Apache to use Apache Kudu. Or, if they do, I'm not getting my take and I'm pretty mad -.- http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc File examples/scala/spark-example/README.adoc: http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@22 PS8, Line 22: Kudu client APIs This is misleading. It doesn't use the KuduClient directly. Instead it uses the methods on KuduContext, which is the recommended way to do things. I would say instead that the program "uses the Kudu-Spark integration to:" http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@27 PS8, Line 27: Scan some rows and then split this list item into two: - Scan rows using RDD / DataFrame methods - Scan rows using SparkSQL http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@37 PS8, Line 37: The above runs against a local cluster? What does it do? http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@38 PS8, Line 38: configurable parameters More precisely, these are Java system properties. Say that, since many users will understand what this means. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40 PS8, Line 40: Host Remove http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40 PS8, Line 40: Master nit: master http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40 PS8, Line 40: A String value consisting of Remove this first part of the sentence. Parameters from the command line always start as strings. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@41 PS8, Line 41: This Remove, so it's just "Defaults to..." http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@41 PS8, Line 41: This will need to be pointed to the Kudu cluster you wish to target. I think this is obvious and should be omitted. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@43 PS8, Line 43: location You mean the address, right? http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@43 PS8, Line 43: A String value conatining Remove. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@46 PS8, Line 46: A String value containing Remove. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@47 PS8, Line 47: The default table name is Replace with "Defaults to". http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@49 PS8, Line 49: To specify a value at execution time, you'll specify the parameter name in the '-D<name>' format. Omit this sentence since it's a generality about Java system properties, and the example demonstrates the syntax anyway. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@51 PS8, Line 51: set the property `KuduMasters` to a CSV of the master addresses in the form : `host:port` and add a table name value, as shown: Replace with just "use a command like the following:" http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@56 PS8, Line 56: KuduMasters I don't like PascalCase for these parameters. We have the same kuduMaster parameter for the java example except it's named camelCase. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@56 PS8, Line 56: TableName tableName http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@56 PS8, Line 56: target/kudu-spark-example-1.0-SNAPSHOT.jar Split the command so it's 80 or less characters per line. Remember to use \ to break bash commands into multiple lines properly. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@59 PS8, Line 59: rk2 'spark-sub Do we have to care whether Spark is running on YARN vs anything else? Seems like the answer should be no. What do you do if you are running on Mesos? Is that still a thing for Spark 2? http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@60 PS8, Line 60: your '--master' parameter What does this mean? http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala File examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala: http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@19 PS8, Line 19: import collection.JavaConverters._ : : import org.apache.kudu.client._ : import org.apache.kudu.spark.kudu._ : : import org.apache.spark.sql.SparkSession : import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} : import org.apache.spark.sql.Row Double-check that you are complying with the import rules at https://github.com/databricks/scala-style-guide#imports. We don't have a chosen Scala style guide, so I'm choosing the Databricks one for this file because it's what's used for Spark itself. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32 PS8, Line 32: //Kudu Space between the // and the first word // Kudu master address list. Here and in most every other comment in the file. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32 PS8, Line 32: KuduMasters As I commented elsewhere, can we name these parameters camelCase instead of PascalCase? kuduMasters tableName sparkMaster This will be consistent with the Java example. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32 PS8, Line 32: KuduMasters I know Grant said these should be capitalized like this if they are constants, but they aren't. They are immutable values retrieved from the environment, so I think they should also be camelCase. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@40 PS8, Line 40: //Defining a class that we'll use to insert data into the table. It seems like there's benefits to using a case class-- the RDD toDF() calls below don't need an explicit schema, for example. Maybe this comment should explain we're using a case class for this reason? This will help newer people make good decisions about data modeling with Spark. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@45 PS8, Line 45: //Define our session and context variables for use throughout the program. I think this comment is redundant and can be omitted, unless you are going to explain more about what a KuduContext is. Maybe you should say explicitly that all interaction with Kudu should be through the KuduContext, or in extremis use the KuduClient obtainable from KuduContext#kuduClient? http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@47 PS8, Line 47: kc Rename to kuduContext for maximum readability. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@49 PS8, Line 49: //Importing a class from the SparkSession we instantiated above, to allow for easier RDD -> DF conversions. : import spark.implicits._ Can you import this at the top? I feel like this is an antipattern. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@52 PS8, Line 52: Define the schema of the table we're going to create. nit: I'd just say "The schema of the table we're going to use." http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@53 PS8, Line 53: //This will be a basic schema, with only two columns: 'name' and 'id' This is obvious from the code and doesn't need to be comment. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@62 PS8, Line 62: xecute the command from the KuduContext to Remove, so it's just "Create the..." http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@64 PS8, Line 64: .setNumReplicas(1).addHashPartitions(List(IdCol).asJava, 3)) Please split into lines of 80 characters or less. Occasional overflows up to 100 characters are ok if splitting the line makes it harder to read. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@68 PS8, Line 68: //Create an array of User objects, put them into an RDD, convert the RDD to a DF and then insert the rows. This restates the code. Remove. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@76 PS8, Line 76: //Specify the columns of the table we want to read, push those into a kuduRDD object for the current SparkContext and the table. : //Map the values in the RDD to give us a tuple format, and then print the results. Remove. Don't just restate the code in comments-- explain it. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@78 PS8, Line 78: val readCols = Seq(NameCol, IdCol) Add a log message here: "Reading back the rows written" or something. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@81 PS8, Line 81: .collect().foreach(println(_)) Does .show() not work? http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@84 PS8, Line 84: //Create another Array of User objects to upsert, pass them into an RDD, convert it to a DF and upsert it to the table specified. Remove. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@93 PS8, Line 93: //Create a DF and map the values of the KuduMaster and TableName values. Remove. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@96 PS8, Line 96: "select * from " + TableName + " where " + IdCol + " > 1000" This would be nicer if you used https://docs.scala-lang.org/overviews/core/string-interpolation.html. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@97 PS8, Line 97: Extra newline. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@101 PS8, Line 101: //Delete the table and shutdown the spark session. Remove, or replace with "Clean up." -- To view, visit http://gerrit.cloudera.org:8080/11788 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9ba09f0118c054a07b951e241c31d66245c57d3f Gerrit-Change-Number: 11788 Gerrit-PatchSet: 9 Gerrit-Owner: Mitch Barnett <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Greg Solovyev <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mitch Barnett <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Mon, 29 Oct 2018 20:02:51 +0000 Gerrit-HasComments: Yes
