Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12879 )
Change subject: WIP: [backup] Add incremental backup/restore support ...................................................................... Patch Set 2: (11 comments) http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/protobuf/backup.proto File java/kudu-backup/src/main/protobuf/backup.proto: http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/protobuf/backup.proto@118 PS2, Line 118: int32 version = 9; Skipped 8? Since this is a new message type, I'd actually make this the first field and move it to the top, to increase its visibility. http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/BackupGraph.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/BackupGraph.scala: http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/BackupGraph.scala@163 PS2, Line 163: * Vertex class to represent vertices in the backup graph. Curious why you went with the noun "vertex"; isn't "node" the more commonly used term in CS? http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala: http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@35 PS2, Line 35: * */ Nit: Javadoc not quite formatted properly? http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@58 PS2, Line 58: // find the previous backup and use the `to_ms` metadata as the `from_ms` time for this backup. Nit: too long? http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala: http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackupRDD.scala@144 PS2, Line 144: val columnCount = if (incremental) fieldCount - 1 else fieldCount This function is called per-row, right? I wonder if it'd be more performant to have a "full" version and an "incremental" version rather than checking "if incremental" with each call. http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala: http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala@39 PS2, Line 39: * */ Formatting? http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala@54 PS2, Line 54: // Disable file splitting to keep one Spark task per Kudu partition : // and reduce compaction when restoring. : session.conf.set(ParquetInputFormat.SPLIT_FILES, false) Did you test what effect this has? If it may have a drastic impact on performance, may want to separate it into its own patch so it's not buried in this large incremental backup patch. http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala@100 PS2, Line 100: // Get the operation type based on the change type column. : // This will always be the last column in the row. : val rowActionValue = row.getByte(row.length - 1) Does Scala allow returning tuples like in Python? Seems like it'd be more elegant to return the row action separately from the row itself. Maybe more efficient too; you wouldn't need to convert it from an enum into a byte and back. http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala: http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala@37 PS2, Line 37: tables: Seq[String], : rootPath: String, : kuduMasterAddresses: String = InetAddress.getLocalHost.getCanonicalHostName, Why do these three need to be listed here and in RestoreOptions if both extend CommonOptions? http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala@93 PS2, Line 93: // TODO: Document the limitations based on cluster configuration (ex: ancient history watermark). Nit: too long? http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/SessionIO.scala File java/kudu-backup/src/main/scala/org/apache/kudu/backup/SessionIO.scala: http://gerrit.cloudera.org:8080/#/c/12879/2/java/kudu-backup/src/main/scala/org/apache/kudu/backup/SessionIO.scala@59 PS2, Line 59: class SessionIO(val session: SparkSession, options: CommonOptions) { Could you doc at least the interesting methods here? -- To view, visit http://gerrit.cloudera.org:8080/12879 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50b21d921fefbf4d7e8bd1766285258e8014d890 Gerrit-Change-Number: 12879 Gerrit-PatchSet: 2 Gerrit-Owner: Grant Henke <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Mon, 15 Apr 2019 04:21:22 +0000 Gerrit-HasComments: Yes
