[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Andrew Wong (Code Review) Wed, 09 Jan 2019 15:56:18 -0800

Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12087 )


Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@216
PS2, Line 216:   private def getOperationType(parameters: Map[String, String]): 
OperationType = {
             :     
parameters.get(OPERATION).map(stringToOperationType).getOrElse(Upsert)
             :   }
> I didn't change this behavior. I just refactored it into the method from ab
Ah I missed the old L105. SGTM.


http://gerrit.cloudera.org:8080/#/c/12087/4/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/12087/4/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@449
PS4, Line 449:  * In order to preserve exactly once semantics a sink must be 
idempotent in the face of
             :  * multiple attempts to add the same batch.
             :  *
             :  * Insert ignore support (KUDU-1563) would be useful, but while 
that doesn't exist
             :  * using upsert will work. Delete ignore would also be useful.
We chatted about this offline. I think it'd be helpful to throw in some context 
about what Spark is doing, that would clarify why we don't need the batchId and 
how users should think about the KuduSink options (especially `operationType`). 
My attempt at a revised class-level doc:

"Sinks provide at-least-once semantics by retrying failed batches, and provide 
a `batchId` interface to implement exactly-once-semantics. Since Kudu does not 
internally track batch IDs, this is ignored, and it is up to the user to 
specify an appropriate `operationType` to achieve the desired semantics when 
adding batches. The default `Upsert` allows for KuduSink to handle duplicate 
data and such retries.

Insert ignore support (KUDU-1563) would be useful, but while that doesn't 
exist, using Upsert will work. Delete ignore would also be useful."



--
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 4
Gerrit-Owner: Grant Henke <granthe...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <granthe...@apache.org>
Gerrit-Reviewer: Hao Hao <hao....@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 23:55:37 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Reply via email to