Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12484 )
Change subject: KUDU-2672: [spark] Optionally repartition to match Kudu partitions ...................................................................... Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/12484/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12484/1//COMMIT_MSG@12 PS1, Line 12: : Repartitioning ensures that one task/client is only : writing to a single tablet. This improves throughput : by improving batching especially for tables with a large : number of partitions. > Have you had a chance to try this on a real cluster? What sort of improveme I haven't yet, but I plan to. I can update the commit message when I do. http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala: http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@385 PS1, Line 385: tables > table's Done http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@386 PS1, Line 386: val keyedRdd = rdd.mapPartitions { rows => > Can we hoist the openTable and/or KuduPartitionerBuilder.build() out of the I make these calls inside of mapPartitions, because each partition is a task/executor. The problem with moving this call outside of mapPartitions is that the converter and partitioner would need to be serializable. This step should effectively happen all at once. Do you think it's worth making these things serializable? http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@389 PS1, Line 389: val partitioner = new KuduPartitioner.KuduPartitionerBuilder(table).build() > Can we satisfy getPartitionCount() above using the table and partitioner we If we made them serializable we could. See my response above. http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@407 PS1, Line 407: val shuffledRDD = if (writeOptions.repartitionSort) { > I think Impala will automatically partial sort if the number of rows being Best I can tell we match Impala's functionality of sorting within a partition. I am looking at https://issues.apache.org/jira/browse/IMPALA-3742 for reference. https://impala.apache.org/docs/build/html/topics/impala_kudu.html#kudu_dml http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala: http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala@454 PS1, Line 454: new KuduWriteOptions(repartition = true, repartitionSort = true)) > Should we also test with repartitionSort=false? yeah, we should. I will add. -- To view, visit http://gerrit.cloudera.org:8080/12484 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8763615997bccc08901235841149fc3bacb321e7 Gerrit-Change-Number: 12484 Gerrit-PatchSet: 1 Gerrit-Owner: Grant Henke <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Fri, 15 Feb 2019 16:22:22 +0000 Gerrit-HasComments: Yes
