Grant Henke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12484 )

Change subject:  KUDU-2672: [spark] Optionally repartition to match Kudu 
partitions
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/12484/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12484/1//COMMIT_MSG@12
PS1, Line 12:
            : Repartitioning ensures that one task/client is only
            : writing to a single tablet. This improves throughput
            : by improving batching especially for tables with a large
            : number of partitions.
> Have you had a chance to try this on a real cluster? What sort of improveme
I haven't yet, but I plan to. I can update the commit message when I do.


http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
File 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala:

http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@385
PS1, Line 385: tables
> table's
Done


http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@386
PS1, Line 386:     val keyedRdd = rdd.mapPartitions { rows =>
> Can we hoist the openTable and/or KuduPartitionerBuilder.build() out of the
I make these calls inside of mapPartitions, because each partition is a 
task/executor. The problem with moving this call outside of mapPartitions is 
that the converter and partitioner would need to be serializable. This step 
should effectively happen all at once.

Do you think it's worth making these things serializable?


http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@389
PS1, Line 389:       val partitioner = new 
KuduPartitioner.KuduPartitionerBuilder(table).build()
> Can we satisfy getPartitionCount() above using the table and partitioner we
If we made them serializable we could. See my response above.


http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@407
PS1, Line 407:     val shuffledRDD = if (writeOptions.repartitionSort) {
> I think Impala will automatically partial sort if the number of rows being
Best I can tell we match Impala's functionality of sorting within a partition. 
I am looking at https://issues.apache.org/jira/browse/IMPALA-3742 for reference.

https://impala.apache.org/docs/build/html/topics/impala_kudu.html#kudu_dml


http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
File 
java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala:

http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala@454
PS1, Line 454:       new KuduWriteOptions(repartition = true, repartitionSort = 
true))
> Should we also test with repartitionSort=false?
yeah, we should. I will add.



--
To view, visit http://gerrit.cloudera.org:8080/12484
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8763615997bccc08901235841149fc3bacb321e7
Gerrit-Change-Number: 12484
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Fri, 15 Feb 2019 16:22:22 +0000
Gerrit-HasComments: Yes

Reply via email to