Grant Henke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10375 )

Change subject: WIP: Kudu Backup/Restore Spark Jobs
......................................................................


Patch Set 13:

(34 comments)

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/build.gradle
File java/kudu-backup/build.gradle:

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/build.gradle@26
PS12, Line 26:   compile (libs.protobufJavaUtil) {
             :     // Make sure wrong Guava version is not pulled in.
             :     exclude group: "com.google.guava", module: "guava"
             :   }
             :   compile (libs.scopt)  {
             :
> Doc why we need these exclusions?
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/pom.xml
File java/kudu-backup/pom.xml:

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/pom.xml@211
PS12, Line 211:             <plugin>
> Need to specify ${maven-surefire-plugin.version} for the version?
We define that in plugin-management on the root pom. Not sure why we use 
plugin-management sometimes but not others. Also not sure why we don't use 
dependency-management.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/pom.xml@221
PS12, Line 221:         </plugins>
> Do we actually need to replica this here?
I assume this is what makes eclipse work. I added it for that. Hopefully maven 
is gone soon. :D


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/protobuf/backup.proto
File java/kudu-backup/src/main/protobuf/backup.proto:

PS12:
> Please add a short doc to each message definition.
Do you have a suggestion what to include so that the doc enhances the 
descriptive names and the overall summary at the top?


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/protobuf/backup.proto@19
PS12, Line 19: // Though these are similar to the kudu common protobufs,
             : // they are specific to the kudu backup application and
             : // should be kept seperate.
> FWIW, I think the stronger explanation is that these are strictly "applicat
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/protobuf/backup.proto@28
PS12, Line 28:
> Nit: stylistically, we tend to order our PB definitions such that "leaf" me
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@50
PS12, Line 50:       // TODO: Take parameter for the SaveMode
> What effect does this have? Parallelism of the backup process?
I was thinking this would be a way to logically define which files go with the 
partitions. However, after further discussion adding more information in the 
metadata file or even using the metadata in parquet itself is likely a good 
approach when that is needed.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@65
PS12, Line 65:     val json = JsonFormat.printer().print(metadata)
> Doc what does 'false' means here.
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@74
PS12, Line 74:
> Can we add some useful text to this exception?
Scopt (the argument parser library) will print an error message to the CLI in 
the case we couldn't parse the options. That said I can still add generic text 
to the error message.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala
File 
java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala:

PS12:
> License.
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@31
PS12, Line 31:                              format: String = 
KuduBackupOptions.defaultFormat,
> Nit: some of these are terminated with a period and some aren't. Fix?
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@36
PS12, Line 36: object KuduBackupOptions {
> Nit: "Comma-separated addresses of Kudu masters"
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@46
PS12, Line 46:       .action((v, o) => o.copy(path = v))
> What are valid values for this?
I have a todo to limit the "accepted" formats in the code where this argument 
is used. I expect we will provide an enum of valid values or eliminate this 
once we settle on a format. For now this allows experimenting with any format 
the spark supports.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@49
PS12, Line 49:
> Is this something really worth customizing?
We may eliminate this, but it exists so we can experiment during our 
performance testing.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@61
PS12, Line 61:       .action((v, o) => o.copy(format = v))
> Do we really want to expose this?
We may eliminate this, but it exists so we can experiment during our 
performance testing.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala:

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@38
PS12, Line 38: sc
> Nit: maybe sparkContext for symmetry with kuduContext?
KuduContext calls it `sc` too.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@49
PS12, Line 49:     // Create the scan tokens for each partition
> Doc why the scan is fault tolerant and why we have to use LEADER_ONLY.
I have a todo below to provide better support for picking replicas. For now 
this is just for simplicity.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala:

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala@61
PS12, Line 61:       val writeOptions = new 
KuduWriteOptions(ignoreDuplicateRowErrors = false, ignoreNull = false)
> Doc what the two 'false' args mean.
Since KuduWriteOptions is a scala class I can use named parameters to make this 
more clear. I will do that.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala@87
PS12, Line 87:       .getOrElse(throw new IllegalArgumentException("could not 
parse the arguments"))
> Add useful text.
See my comment on KuduBackup.scala


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala
File 
java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala:

PS12:
> License.
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala@18
PS12, Line 18:
> Much of the feedback I left on KuduBackupOptions applies here too.
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala@38
PS12, Line 38:
> Only relevant when createTables is true, right?
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala@43
PS12, Line 43:       .action((v, o) => o.copy(path = v))
> Nit: "true to create tables during restore, false if they already exist"
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestoreOptions.scala@48
PS12, Line 48:       .action((v, o) => o.copy(kuduMasterAddresses = v))
> Not really clear what the "overriding" behavior means; there's not enough c
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala
File 
java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@47
PS12, Line 47:
> Can you make this a constant somewhere so that if we decide to change the d
It is a constant in KuduRestoreOptions.defaultTableSuffix


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@58
PS12, Line 58:
             :   test("Random Backup and Restore") {
             :     Random.javaRandomToR
> One thing we found useful is, if you look at SeedRandom() (C++ Kudu test fu
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@79
PS12, Line 79:   // TODO: Move to a Schema equals/equivalent method
> This seems like it'd be generally useful outside of BDR. Is there a natural
This could actually be refactored all the way back to the kudu-client tests. I 
would like to do that to make it more generally useable, but maybe as a follow 
up? I will leave a todo.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@95
PS12, Line 95:       defaultValuesMatch(before.getDefaultValue, 
after.getDefaultValue) &&
             :       Objects.equal(before.getDesiredBlockSize, 
after.getDesiredBlockSize) &&
             :       Objects.equal(before.getEncoding, after.getEncoding) &&
> Moe this to L138? No need to compute if we're not going to use it. Not sure
I define this upfront given I need it in 2 contexts. I need it for defining the 
column and the default value. This seamed the most simple approach given it's 
just randomly generated, needs to match, and it's test code. I am open to 
suggestions for a better way though.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@101
PS12, Line 101:
> Can this be expressed as a match statement? Seems like it'd be easier to re
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@114
PS12, Line 114:   def partitionSchemasMatch(before: PartitionSchema, after: 
PartitionSchema): Boolean = {
> Move this to L142? No need to compute it if we're not going to use a defaul
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala@150
PS12, Line 150:       val precision = 
Random.nextInt(DecimalUtil.MAX_DECIMAL_PRECISION) + 1
> Our tests all use just one tserver? That's disappointing.
Yeah, per the base spark test. We can and should improve this in a separate 
patch.


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-client/src/main/java/org/apache/kudu/Type.java
File java/kudu-client/src/main/java/org/apache/kudu/Type.java:

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-client/src/main/java/org/apache/kudu/Type.java@194
PS12, Line 194:     for (Type t : values()){
> Nit: for (Type ...
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-client/src/main/java/org/apache/kudu/Type.java@195
PS12, Line 195:       if (t.name().equals(name)){
> Nit: if (t.name() ...
Done


http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala
File 
java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala:

http://gerrit.cloudera.org:8080/#/c/10375/12/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@117
PS12, Line 117:       .setNumReplicas(1)
> Nit: indentation.
Done



--
To view, visit http://gerrit.cloudera.org:8080/10375
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If02183a2f833ffa0225eb7b0a35fc7531109e6f7
Gerrit-Change-Number: 10375
Gerrit-PatchSet: 13
Gerrit-Owner: Grant Henke <granthe...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Grant Henke <granthe...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Wed, 13 Jun 2018 15:53:22 +0000
Gerrit-HasComments: Yes

Reply via email to