Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10375 )

Change subject: WIP: Kudu Backup/Restore Spark Jobs
......................................................................


Patch Set 8:

(13 comments)

I suck at Scala but I did a first pass on this. Overall it looks very good. Let 
me know if you want specific feedback on anything. Since this is a large patch 
already, I think it would be valuable to do some cleanup where needed and 
commit a baseline before continuing to add more features to this patch. Also I 
resisted the urge to point out formatting nits but we can do a pass like that 
before commit a bit later.

http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@33
PS8, Line 33: object KuduBackup {
Can we mark this as experimental so we can commit some of this code even before 
it's fully baked?


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@55
PS8, Line 55:       writeTableMetadata(tableMetadata, tablePath, session)
We don't have to solve this immediately, and it has been mentioned before in 
design discussions, but it at least deserves a TODO here: what happens if there 
are new columns/partitions created or existing columns/partitions dropped while 
we are doing the backup? It seems that we are doing the backup and then 
collecting the table metadata when we are done and so right now there is 
nothing guaranteeing that the stored data on disk exactly matches up with the 
schema info written in the metadata file. At some point we considered 
prohibiting DDL ops while doing a backup to avoid that race, I'm sure there are 
other things we can do as well.


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@65
PS8, Line 65:     Serialization.write(metadata, out)
Rather than use Spark serialization based on POJO (POSO?) definitions for 
metadata files, I would like to see us use protobufs for the metadata so that 
evolving the fields is straightforward. I think it would make sense to define 
the serialization format in a .proto file and for the column schema and 
partition schema structs, hopefully we can directly reference SchemaPB and 
PartitionSchemaPB from src/kudu/common/common.proto

One caveat is figuring out the right container format to use for the serialized 
protobuf. One option is the PB container file format we use elsewhere in Kudu, 
the main drawback being that we would have to port the reader / writer code 
from C++ to Java, although that probably isn't very difficult. Another option 
could be to store it as Avro, and use a protobuf-to-Avro file converter 
library, of which there are a couple of implementations out there.


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@79
PS8, Line 79: move to a utility class
yeah, +1


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala
File 
java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala:

http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@30
PS8, Line 30: Address
nit: maybe we should name this "addresses"


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupOptions.scala@37
PS8, Line 37: timestamp
nit: UNIX timestamp in milliseconds since the epoch


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala:

http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@33
PS8, Line 33: @transient val table
how does this get reconstructed after deserialization?


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@38
PS8, Line 38: smaller scan tokens
you probably saw this but someone started working on this at 
https://gerrit.cloudera.org/c/10406/


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@65
PS8, Line 65: Array(leaderLocation)
Hmm, we may want to simply use token.getTablet.getReplicas because the leader 
might change at any time


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@69
PS8, Line 69: Do we need a custom spark partitioner
good question, I don't know the answer to this either right now but worth 
looking into


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@89
PS8, Line 89: efficiency
what kind of efficiency do you mean?


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@124
PS8, Line 124: Move to a utility class?
+1


http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
File 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala:

http://gerrit.cloudera.org:8080/#/c/10375/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@52
PS8, Line 52: @transient
Pardon my Scala-newbness, but is there some code somewhere to re-initialize the 
'sc' class field when an instance of KuduContext is deserialized?



--
To view, visit http://gerrit.cloudera.org:8080/10375
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If02183a2f833ffa0225eb7b0a35fc7531109e6f7
Gerrit-Change-Number: 10375
Gerrit-PatchSet: 8
Gerrit-Owner: Grant Henke <granthe...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Grant Henke <granthe...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Fri, 18 May 2018 03:38:25 +0000
Gerrit-HasComments: Yes

Reply via email to