Todd Lipcon has posted comments on this change. Change subject: Spark connectors for Kudu ......................................................................
Patch Set 4: (8 comments) http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/pom.xml File java/kudu-spark/pom.xml: Line 92: <artifactId>hadoop-client</artifactId> hadoop-client includes stuff like the HDFS client which we shouldn't need. I think we can probably get by with hadoop-common (and maybe some set of exclusions?) Best I can tell we only use Configuration and NullWritable which are both in the common jar and shouldn't have any real deps.. everything else I think we should pick up transitively from kudu-mapreduce above http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala File java/kudu-spark/src/main/scala/org/kududb/spark/DefaultSource.scala: Line 52: throw new IllegalArgumentException(s"Invalid value for $TABLE_KEY '$tableName'") woah. mind blown. Line 90: kuduSchema.getColumns.asScala.map({ c => c.getName -> c }).toMap this is my scala inexperience talking, but I find this hard to know whether => or -> has higher precedence. could you add some parents to disambiguate? if I'm reading it correctly it's c => (c.getName -> c) Line 118: * @param requiredColumns clumns that are being requested by the requesting query typo Line 129: def getKuduValue(row: RowResult, columnName: String): Any = { can this be private? http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala File java/kudu-spark/src/main/scala/org/kududb/spark/KuduContext.scala: Line 144: f(it, syncClient, asyncClient) I think this causes both of the clients to end up evaluated, which constructs parallel clients (and likely twice as many nettys, etc)... based on http://matt.might.net/articles/implementing-laziness/, you'd need to change the parameters to be "by-name" parameters, whatever that means. Given that the async client is also considered a bit unstable, maybe we should have separate functions, one that yields async, and one that yields sync? or just do the sync one, since you could probably have an API that grabs the async out of the sync wrapper? http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/src/test/scala/org/kududb/spark/DefaultSourceSuite.scala File java/kudu-spark/src/test/scala/org/kududb/spark/DefaultSourceSuite.scala: Line 36: assert(sqlContext.sql("SELECT * FROM " + tableName).collectAsList().size() == rowCount) I think we should test some basic value-dependent queries like SUM and GROUP_CONCAT(strings) -- there's a lot of subtlety in the fact that our MR input format will reuse the same rowresult object, and we should make sure that Spark respects this correctly http://gerrit.cloudera.org:8080/#/c/1788/4/java/kudu-spark/src/test/scala/org/kududb/spark/TestContext.scala File java/kudu-spark/src/test/scala/org/kududb/spark/TestContext.scala: Line 42: new ColumnSchemaBuilder("c2_s", Type.STRING).build()) should test some nulls -- To view, visit http://gerrit.cloudera.org:8080/1788 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic187513ef9724d50024f7401d7ecd19d53554245 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
