Mitch Barnett has posted comments on this change. ( http://gerrit.cloudera.org:8080/11788 )
Change subject: [examples] Add basic Spark example (scala) ...................................................................... Patch Set 10: (41 comments) http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@9 PS8, Line 9: Kudu-Spark > You mean it adds a basic kudu-spark example? I think I copied this from the Kudu-Java example patch notes.... Fixed http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@11 PS8, Line 11: It will allow users to pull down the pom.xml and scala source, > nit: No paragraph break. Pushing to gerrit throws a warn message about line breaks, which is why I added this in. Safe to ignore and leave it as a single line without width considerations, or is there a better way to meet the width limitation? http://gerrit.cloudera.org:8080/#/c/11788/8//COMMIT_MSG@11 PS8, Line 11: users to > Say "users" instead of "customers". Nobody pays Apache to use Apache Kudu. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc File examples/scala/spark-example/README.adoc: http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@22 PS8, Line 22: the Kudu-Spark i > This is misleading. It doesn't use the KuduClient directly. Instead it uses Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@27 PS8, Line 27: Scan some rows > and then split this list item into two: Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@37 PS8, Line 37: $ mvn package > The above runs against a local cluster? What does it do? Not quite - it will run as a local execution, and build the 'spark' architecture locally when executed. There's no cluster to run against here, it's effectively a standalone application running in it's own JVM. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@38 PS8, Line 38: et/kudu-spark-example-1 > More precisely, these are Java system properties. Say that, since many user Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40 PS8, Line 40: > Remove Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40 PS8, Line 40: > Remove this first part of the sentence. Parameters from the command line al Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@40 PS8, Line 40: > nit: master Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@41 PS8, Line 41: ere are a few Java system properties defined in SparkExample.scala: > I think this is obvious and should be omitted. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@41 PS8, Line 41: > Remove, so it's just "Defaults to..." Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@43 PS8, Line 43: A comma-separated list of > Remove. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@43 PS8, Line 43: master > You mean the address, right? Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@46 PS8, Line 46: park2 'spark-submit' job, > Remove. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@47 PS8, Line 47: the '--master' Spark para > Replace with "Defaults to". Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@49 PS8, Line 49: Defaults to 'spark_test'. > Omit this sentence since it's a generality about Java system properties, an Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@51 PS8, Line 51: different set of masters for the Kudu cluster from the : command line and use a custom table name, use a c > Replace with just "use a command like the following:" Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@59 PS8, Line 59: > Do we have to care whether Spark is running on YARN vs anything else? Seems Yep, this was corrected in a later revision. I misunderstood some of the previous comments and made this a bit too CDH-centric. This should now just read: "To run this as a spark2 'spark-submit' job" http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/README.adoc@60 PS8, Line 60: ubmit command > What does this mean? In the spark-submit, you supply the spark-master address as part of the command-line options. In order to make this work between both spark-submit and also a standalone java execution, I need to have the '--master' value from the spark options also be supplied to the Kudu-Spark example via '-DSparkMaster='. Can explain in more detail in chat/call if needed. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala File examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala: http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@19 PS8, Line 19: import collection.JavaConverters._ : : import org.slf4j.LoggerFactory : : import org.apache.kudu.client._ : import org.apache.kudu.spark.kudu._ : : import org.apache.spark.sql.Spa > Double-check that you are complying with the import rules at https://github Looks correct to me, but please call out any misuse. Placed the JavaConverters first, rearranged LoggerFactory to come before org.apache.kudu, and had org.apache.spark last. Only calling required definitions unless there are multiple required. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32 PS8, Line 32: kuduMasters > As I commented elsewhere, can we name these parameters camelCase instead of Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32 PS8, Line 32: kuduMasters > I know Grant said these should be capitalized like this if they are constan Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@32 PS8, Line 32: // Kud > Space between the // and the first word Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@40 PS8, Line 40: // Defining a class that we'll use to insert data into the table > It seems like there's benefits to using a case class-- the RDD toDF() calls Changed this to be the following: // Defining a class that we'll use to insert data into the table. // Because we're defining a case class here, we circumvent the need to explicitly define a schema later // in the code, like during our RDD -> toDF() calls later on. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@45 PS8, Line 45: > I think this comment is redundant and can be omitted, unless you are going Added the following - let me know if this is insufficient: // Define our session and context variables for use throughout the program. // The kuduContext is a serializable container for Kudu client connections, while the // SparkSession is the entry point to SparkSQL and the Dataset/DataFrame API. http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@47 PS8, Line 47: > Rename to kuduContext for maximum readability. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@49 PS8, Line 49: // The kuduContext is a serializable container for Kudu client connections, : // while the SparkSessio > Can you import this at the top? I feel like this is an antipattern. It is an anti-pattern, it's ugly - and there's no way to move it that I've been able to figure out. It's even in spark's own documentation as such: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#hive-tables Moving that import fails the build: [ERROR] /Users/mbarnett/src/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala:14: error: not found: object spark [ERROR] import spark.implicits._ [ERROR] ^ [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE because it's using the 'spark' instance we declare above it, which actually contains the lib. This link explains it a bit better: https://stackoverflow.com/questions/39968707/spark-2-0-missing-spark-implicits http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@52 PS8, Line 52: l kuduContext = new KuduContext(kuduMasters, spark.sq > nit: I'd just say "The schema of the table we're going to use." Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@53 PS8, Line 53: > This is obvious from the code and doesn't need to be comment. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@62 PS8, Line 62: StructField(nameCol, String > Remove, so it's just "Create the..." Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@64 PS8, Line 64: > Please split into lines of 80 characters or less. Occasional overflows up t Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@68 PS8, Line 68: if (!kuduContext.tableExists(tableName)) { > This restates the code. Remove. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@76 PS8, Line 76: val userRDD = spark.sparkContext.parallelize(data) : val userDF = userRDD.toDF() > Remove. Don't just restate the code in comments-- explain it. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@78 PS8, Line 78: kuduContext.insertRows(userDF, tableName) > Add a log message here: "Reading back the rows written" or something. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@81 PS8, Line 81: fo("Reading back the rows just > Does .show() not work? I don't think there's a .show() call for RDD's, is there? readRDD.map returns an RDD - as far as I know, .show() only exists on a DataFrame or called directly through sqlContext (as done below). http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@84 PS8, Line 84: val userTuple = readRDD.map { case Row(name: String, id: Int) => (name, id) } > Remove. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@93 PS8, Line 93: > Remove. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@96 PS8, Line 96: > kuduMasters, "kudu.table" -> tableName)).kudu > This would be nicer if you used https://docs.scala-lang.org/overviews/core/ Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@97 PS8, Line 97: sqlDF.createOrReplaceTempView(tableName) > Extra newline. Done http://gerrit.cloudera.org:8080/#/c/11788/8/examples/scala/spark-example/src/main/scala/org/apache/kudu/examples/SparkExample.scala@101 PS8, Line 101: nally { > Remove, or replace with "Clean up." Done -- To view, visit http://gerrit.cloudera.org:8080/11788 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9ba09f0118c054a07b951e241c31d66245c57d3f Gerrit-Change-Number: 11788 Gerrit-PatchSet: 10 Gerrit-Owner: Mitch Barnett <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Greg Solovyev <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mitch Barnett <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Mon, 29 Oct 2018 22:03:22 +0000 Gerrit-HasComments: Yes
