Grant Henke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12484 )

Change subject: KUDU-2672: [spark] Optionally repartition to match Kudu 
partitions
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
File 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala:

http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@386
PS1, Line 386:       val converter = new RowConverter(table.getSchema, schema, 
writeOptions.ignoreNull)
> I'd be fine with an approach here that mimics whatever Impala does. Given t
With a brief look at Impala's KuduPartitionExpr and ScalarExpr implementation 
and docs it looks like Impala behaves the same way. It runs once for each 
"task".

https://github.com/apache/impala/blob/master/be/src/exprs/kudu-partition-expr.cc
https://github.com/apache/impala/blob/master/be/src/exprs/scalar-expr.h

I think this is okay. In the worst case, if a table changed while running and a 
KuduPartitioner computed later was different than one computed upfront, the 
later computed partitioner would actually be more accurate. Additionally the 
data would still be correct when loaded, just potentially less efficiently 
loaded.


http://gerrit.cloudera.org:8080/#/c/12484/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@407
PS1, Line 407:     val shuffledRDD = if (writeOptions.repartitionSort) {
> Quoting from the second page:
I don't think I have the capability in Spark to get an "estimate" of rows 
before defining a part of  the execution graph.

Perhaps an optimization like this is less important in Spark given its use is 
less interactive. And in cases where it is interactive this feature can be 
disabled.



--
To view, visit http://gerrit.cloudera.org:8080/12484
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8763615997bccc08901235841149fc3bacb321e7
Gerrit-Change-Number: 12484
Gerrit-PatchSet: 3
Gerrit-Owner: Grant Henke <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Mon, 25 Feb 2019 16:30:42 +0000
Gerrit-HasComments: Yes

Reply via email to