Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12484 )
Change subject: KUDU-2672: [spark] Optionally repartition to match Kudu partitions ...................................................................... Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala: http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@386 PS1, Line 386: val converter = new RowConverter(table.getSchema, schema, writeOptions.ignoreNull) > I'd be fine with an approach here that mimics whatever Impala does. Given t With a brief look at Impala's KuduPartitionExpr and ScalarExpr implementation and docs it looks like Impala behaves the same way. It runs once for each "task". https://github.com/apache/impala/blob/master/be/src/exprs/kudu-partition-expr.cc https://github.com/apache/impala/blob/master/be/src/exprs/scalar-expr.h I think this is okay. In the worst case, if a table changed while running and a KuduPartitioner computed later was different than one computed upfront, the later computed partitioner would actually be more accurate. Additionally the data would still be correct when loaded, just potentially less efficiently loaded. http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@407 PS1, Line 407: val shuffledRDD = if (writeOptions.repartitionSort) { > Quoting from the second page: I don't think I have the capability in Spark to get an "estimate" of rows before defining a part of the execution graph. Perhaps an optimization like this is less important in Spark given its use is less interactive. And in cases where it is interactive this feature can be disabled. -- To view, visit http://gerrit.cloudera.org:8080/12484 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8763615997bccc08901235841149fc3bacb321e7 Gerrit-Change-Number: 12484 Gerrit-PatchSet: 3 Gerrit-Owner: Grant Henke <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Mon, 25 Feb 2019 16:30:42 +0000 Gerrit-HasComments: Yes
