[kudu-CR] Create parallelized loader Spark job

Mike Percy (Code Review) Wed, 19 Dec 2018 12:04:08 -0800

Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12101 )


Change subject: Create parallelized loader Spark job
......................................................................


Patch Set 1: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12101/1/java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala
File 
java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala:

http://gerrit.cloudera.org:8080/#/c/12101/1/java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala@83
PS1, Line 83:       rowsWritten += 1
> The subtraction and addition is a little weird. Maybe you can add the happy
It's tough to do that because we only get those errors on auto- or manual 
flush, not in every iteration of the loop. So we increment in the happy path 
and subtract in the (occasional) sad path. Perhaps this is worth a comment.


http://gerrit.cloudera.org:8080/#/c/12101/1/java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala
File 
java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala:

http://gerrit.cloudera.org:8080/#/c/12101/1/java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala@19
PS1, Line 19:   private val TABLE_SCHEMA: Schema = {
> Can you use your fancy schema generator to improve coverage?
Can we leave that as a TODO? This is in service of a specific task which is 
measuring backup performance and this implementation and test seem good enough 
as-is for that purpose.


http://gerrit.cloudera.org:8080/#/c/12101/1/java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala@32
PS1, Line 32:     .setNumReplicas(1)
> Why this?
I probably originally wrote this part of the test before handing this off to 
Grant, after which he did a bunch of refactoring of the generator class. The 
way I would justify this is that we don't really care about replication for the 
purposes of correctness testing on this data generator. The leader will do all 
the validation we need for the scope of this test and adding replicas just 
makes it more likely to have a noisier / flakier test.



--
To view, visit http://gerrit.cloudera.org:8080/12101
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibdfd41a21a7f80d22125c7f4e5ca4ed62c31709d
Gerrit-Change-Number: 12101
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Comment-Date: Wed, 19 Dec 2018 19:45:28 +0000
Gerrit-HasComments: Yes

[kudu-CR] Create parallelized loader Spark job

Reply via email to