Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/10375 )
Change subject: WIP: Kudu Backup/Restore Spark Jobs ...................................................................... Patch Set 13: (34 comments) http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/build.gradle File java/kudu-backup/build.gradle: http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/build.gradle@26 PS12, Line 26: compile (libs.protobufJavaUtil) { : // Make sure wrong Guava version is not pulled in. : exclude group: "com.google.guava", module: "guava" : } : compile (libs.scopt) { : > Doc why we need these exclusions? Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/pom.xml File java/kudu-backup/pom.xml: http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/pom.xml@211 PS12, Line 211: <plugin> > Need to specify ${maven-surefire-plugin.version} for the version? We define that in plugin-management on the root pom. Not sure why we use plugin-management sometimes but not others. Also not sure why we don't use dependency-management. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/pom.xml@221 PS12, Line 221: </plugins> > Do we actually need to replica this here? I assume this is what makes eclipse work. I added it for that. Hopefully maven is gone soon. :D http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/protobuf/backup.proto File java/kudu-backup/src/main/protobuf/backup.proto: PS12: > Please add a short doc to each message definition. Do you have a suggestion what to include so that the doc enhances the descriptive names and the overall summary at the top? http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/protobuf/backup.proto@19 PS12, Line 19: // Though these are similar to the kudu common protobufs, : // they are specific to the kudu backup application and : // should be kept seperate. > FWIW, I think the stronger explanation is that these are strictly "applicat Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/protobuf/backup.proto@28 PS12, Line 28: > Nit: stylistically, we tend to order our PB definitions such that "leaf" me Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala: http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@50 PS12, Line 50: // TODO: Take parameter for the SaveMode > What effect does this have? Parallelism of the backup process? I was thinking this would be a way to logically define which files go with the partitions. However, after further discussion adding more information in the metadata file or even using the metadata in parquet itself is likely a good approach when that is needed. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@65 PS12, Line 65: val json = JsonFormat.printer().print(metadata) > Doc what does 'false' means here. Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@74 PS12, Line 74: > Can we add some useful text to this exception? Scopt (the argument parser library) will print an error message to the CLI in the case we couldn't parse the options. That said I can still add generic text to the error message. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala: PS12: > License. Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@31 PS12, Line 31: format: String = KuduBackupOptions.defaultFormat, > Nit: some of these are terminated with a period and some aren't. Fix? Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@36 PS12, Line 36: object KuduBackupOptions { > Nit: "Comma-separated addresses of Kudu masters" Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@46 PS12, Line 46: .action((v, o) => o.copy(path = v)) > What are valid values for this? I have a todo to limit the "accepted" formats in the code where this argument is used. I expect we will provide an enum of valid values or eliminate this once we settle on a format. For now this allows experimenting with any format the spark supports. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@49 PS12, Line 49: > Is this something really worth customizing? We may eliminate this, but it exists so we can experiment during our performance testing. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@61 PS12, Line 61: .action((v, o) => o.copy(format = v)) > Do we really want to expose this? We may eliminate this, but it exists so we can experiment during our performance testing. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala: http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@38 PS12, Line 38: sc > Nit: maybe sparkContext for symmetry with kuduContext? KuduContext calls it `sc` too. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@49 PS12, Line 49: // Create the scan tokens for each partition > Doc why the scan is fault tolerant and why we have to use LEADER_ONLY. I have a todo below to provide better support for picking replicas. For now this is just for simplicity. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala: http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala@61 PS12, Line 61: val writeOptions = new KuduWriteOptions(ignoreDuplicateRowErrors = false, ignoreNull = false) > Doc what the two 'false' args mean. Since KuduWriteOptions is a scala class I can use named parameters to make this more clear. I will do that. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala@87 PS12, Line 87: .getOrElse(throw new IllegalArgumentException("could not parse the arguments")) > Add useful text. See my comment on KuduBackup.scala http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala: PS12: > License. Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala@18 PS12, Line 18: > Much of the feedback I left on KuduBackupOptions applies here too. Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala@38 PS12, Line 38: > Only relevant when createTables is true, right? Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala@43 PS12, Line 43: .action((v, o) => o.copy(path = v)) > Nit: "true to create tables during restore, false if they already exist" Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala@48 PS12, Line 48: .action((v, o) => o.copy(kuduMasterAddresses = v)) > Not really clear what the "overriding" behavior means; there's not enough c Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala File java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala: http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@47 PS12, Line 47: > Can you make this a constant somewhere so that if we decide to change the d It is a constant in KuduRestoreOptions.defaultTableSuffix http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@58 PS12, Line 58: : test("Random Backup and Restore") { : Random.javaRandomToR > One thing we found useful is, if you look at SeedRandom() (C++ Kudu test fu Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@79 PS12, Line 79: // TODO: Move to a Schema equals/equivalent method > This seems like it'd be generally useful outside of BDR. Is there a natural This could actually be refactored all the way back to the kudu-client tests. I would like to do that to make it more generally useable, but maybe as a follow up? I will leave a todo. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@95 PS12, Line 95: defaultValuesMatch(before.getDefaultValue, after.getDefaultValue) && : Objects.equal(before.getDesiredBlockSize, after.getDesiredBlockSize) && : Objects.equal(before.getEncoding, after.getEncoding) && > Moe this to L138? No need to compute if we're not going to use it. Not sure I define this upfront given I need it in 2 contexts. I need it for defining the column and the default value. This seamed the most simple approach given it's just randomly generated, needs to match, and it's test code. I am open to suggestions for a better way though. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@101 PS12, Line 101: > Can this be expressed as a match statement? Seems like it'd be easier to re Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@114 PS12, Line 114: def partitionSchemasMatch(before: PartitionSchema, after: PartitionSchema): Boolean = { > Move this to L142? No need to compute it if we're not going to use a defaul Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@150 PS12, Line 150: val precision = Random.nextInt(DecimalUtil.MAX_DECIMAL_PRECISION) + 1 > Our tests all use just one tserver? That's disappointing. Yeah, per the base spark test. We can and should improve this in a separate patch. http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-client/src/main/java/org/apache/kudu/Type.java File java/kudu-client/src/main/java/org/apache/kudu/Type.java: http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-client/src/main/java/org/apache/kudu/Type.java@194 PS12, Line 194: for (Type t : values()){ > Nit: for (Type ... Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-client/src/main/java/org/apache/kudu/Type.java@195 PS12, Line 195: if (t.name().equals(name)){ > Nit: if (t.name() ... Done http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala: http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@117 PS12, Line 117: .setNumReplicas(1) > Nit: indentation. Done -- To view, visit http://gerrit.cloudera.org:8080/10375 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If02183a2f833ffa0225eb7b0a35fc7531109e6f7 Gerrit-Change-Number: 10375 Gerrit-PatchSet: 13 Gerrit-Owner: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Wed, 13 Jun 2018 15:53:22 +0000 Gerrit-HasComments: Yes