Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )
Change subject: KUDU-2640: Add Spark Structured Streaming Sink ...................................................................... Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala: http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@216 PS2, Line 216: private def getOperationType(parameters: Map[String, String]): OperationType = { : parameters.get(OPERATION).map(stringToOperationType).getOrElse(Upsert) : } > I didn't change this behavior. I just refactored it into the method from ab Ah I missed the old L105. SGTM. http://gerrit.cloudera.org:8080/#/c/12087/4/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala: http://gerrit.cloudera.org:8080/#/c/12087/4/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@449 PS4, Line 449: * In order to preserve exactly once semantics a sink must be idempotent in the face of : * multiple attempts to add the same batch. : * : * Insert ignore support (KUDU-1563) would be useful, but while that doesn't exist : * using upsert will work. Delete ignore would also be useful. We chatted about this offline. I think it'd be helpful to throw in some context about what Spark is doing, that would clarify why we don't need the batchId and how users should think about the KuduSink options (especially `operationType`). My attempt at a revised class-level doc: "Sinks provide at-least-once semantics by retrying failed batches, and provide a `batchId` interface to implement exactly-once-semantics. Since Kudu does not internally track batch IDs, this is ignored, and it is up to the user to specify an appropriate `operationType` to achieve the desired semantics when adding batches. The default `Upsert` allows for KuduSink to handle duplicate data and such retries. Insert ignore support (KUDU-1563) would be useful, but while that doesn't exist, using Upsert will work. Delete ignore would also be useful." -- To view, visit http://gerrit.cloudera.org:8080/12087 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17 Gerrit-Change-Number: 12087 Gerrit-PatchSet: 4 Gerrit-Owner: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Comment-Date: Wed, 09 Jan 2019 23:55:37 +0000 Gerrit-HasComments: Yes