[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Grant Henke has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Supporting Spark streaming DataFrame in KuduContext. KUDU-2539: Supporting Spark streaming DataFrame in KuduContext. This solution follows the way how other sinks ie. KafkaSink is implemented, for details see https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L87 Where on the DataFrame a queryExecution.toRdd.foreachPartition is called to access the InternalRows which mapped to Rows by Catalyst converters. Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Reviewed-on: http://gerrit.cloudera.org:8080/11199 Tested-by: Kudu Jenkins Reviewed-by: Attila Bukor Reviewed-by: Grant Henke --- M java/gradle/dependencies.gradle M java/kudu-spark/build.gradle M java/kudu-spark/pom.xml M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala A java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala 5 files changed, 126 insertions(+), 6 deletions(-) Approvals: Kudu Jenkins: Verified Attila Bukor: Looks good to me, but someone else must approve Grant Henke: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 6 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Wed, 26 Sep 2018 15:45:18 + Gerrit-HasComments: No
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Attila Piros has posted comments on this change. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Patch Set 5: > Patch Set 5: > > (1 comment) In case of structured streaming foreachPartition on a streaming DataFrame is an unsupported operation. You can see this error running the new test without my KuduContext changes: > Task :kudu-spark:test org.apache.kudu.spark.kudu.StreamingTest > testKuduContextWithSparkStreaming FAILED org.apache.spark.sql.streaming.StreamingQueryException: Queries with streaming sources must be executed with writeStream.start();; LocalRelation [value#16] === Streaming Query === Identifier: [id = f00274e4-469e-43fd-8d9a-3e42f81125a3, runId = 35f66191-e282-4112-b226-26cd2b0c6fae] Current Committed Offsets: {} Current Available Offsets: {MemoryStream[value#1]: 0} Current State: ACTIVE Thread State: RUNNABLE Logical Plan: Project [_1#7 AS key#10, _2#8 AS val#11] +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#7, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#8] +- MapElements , int, [StructField(value,IntegerType,false)], obj#6: scala.Tuple2 +- DeserializeToObject assertnotnull(cast(value#1 as int)), obj#5: int +- StreamingExecutionRelation MemoryStream[value#1], [value#1] Caused by: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; LocalRelation [value#16] 1 test completed, 1 failed > Task :kudu-spark:test FAILED ~~~ So this solution uses the same method as was used for KafkaSink: using foreachPartition on RDD-level. Basically it is a workaround. -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Wed, 26 Sep 2018 00:42:12 + Gerrit-HasComments: No
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/11199/5/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala: http://gerrit.cloudera.org:8080/#/c/11199/5/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@305 PS5, Line 305: data.queryExecution.toRdd.foreachPartition(iterator => { I don't fully understand why this is needed. Especially given that we convert back from and InternalRow to a Row below in writePartitionRows. My Spark internal background isn't the strongest. Could you explain what this conversion fixes? -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 25 Sep 2018 18:42:12 + Gerrit-HasComments: Yes
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Attila Piros has posted comments on this change. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Patch Set 5: Gentle ping. Is there any new comments? -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Wed, 12 Sep 2018 14:09:14 + Gerrit-HasComments: No
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Patch Set 5: Code-Review+1 looks good to me, I'll let Grant review it as well. -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Wed, 29 Aug 2018 14:22:32 + Gerrit-HasComments: No
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Hello Kudu Jenkins, Grant Henke, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11199 to look at the new patch set (#5). Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Supporting Spark streaming DataFrame in KuduContext. KUDU-2539: Supporting Spark streaming DataFrame in KuduContext. This solution follows the way how other sinks ie. KafkaSink is implemented, for details see https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L87 Where on the DataFrame a queryExecution.toRdd.foreachPartition is called to access the InternalRows which mapped to Rows by Catalyst converters. Change-Id: Iead04539d3514920a5d6803c34715e5686124572 --- M java/gradle/dependencies.gradle M java/kudu-spark/build.gradle M java/kudu-spark/pom.xml M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala A java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala 5 files changed, 126 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/11199/5 -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Attila Piros has posted comments on this change. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Patch Set 4: > Patch Set 4: > > > Could you add tests to demonstrate and prove the functionality this > > change is supporting? Also to ensure it doesn't break with future > > changes. > So I have tried: https://gist.github.com/attilapiros/3b5ef42c0f7aa08b0e2c834fbadfc574 But there is an unexpected problem StreamTest is a Scalatest but KuduTestSuite is a JUnitSuite and they cannot be mixed: ``` [ERROR] /home/systest/kudu-public/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingSuite.scala:27: error: illegal inheritance; superclass QueryTest [ERROR] is not a subclass of the superclass JUnitSuite [ERROR] of the mixin trait KuduTestSuite [ERROR] class StreamingSuite extends StreamTest with KuduTestSuite { [ERROR] ^ ``` Moreover runtime both would create a spark session :( As StreamTest is quite complex the easier solution would be to remove the "with KuduTestSuite" and create the Kudu cluster and simple test table here. Or using composition instead of inheritance via a new class extending KudutestSuite and call before and after methods accordingly. In both cases here would be a scalatest. Is there any reason why JUnitSuite is used here? What is your suggestion? -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 4 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 14 Aug 2018 15:22:20 + Gerrit-HasComments: No
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Attila Piros has posted comments on this change. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Patch Set 4: > Could you add tests to demonstrate and prove the functionality this > change is supporting? Also to ensure it doesn't break with future > changes. Unfortunately no idea how. If you check my kudu custom sink repo: > Could you add tests to demonstrate and prove the functionality this > change is supporting? Also to ensure it doesn't break with future > changes. Unfortunately it is not easy: here there is custom sink using the KuduContext in streaming: https://github.com/attilapiros/kudu_custom_sink but in our case we would need the Spark streaming test in memory solution too. -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 4 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Mon, 13 Aug 2018 17:01:05 + Gerrit-HasComments: No
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/11199 ) Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Patch Set 1: Could you add tests to demonstrate and prove the functionality this change is supporting? Also to ensure it doesn't break with future changes. -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 1 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Mon, 13 Aug 2018 15:26:16 + Gerrit-HasComments: No
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11199 to look at the new patch set (#4). Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Supporting Spark streaming DataFrame in KuduContext. KUDU-2539: Supporting Spark streaming DataFrame in KuduContext. This solution follows the way how other sinks ie. KafkaSink is implemented, for details see https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L87 Where on the DataFrame a queryExecution.toRdd.foreachPartition is called to access the InternalRows which mapped to Rows by Catalyst converters. Change-Id: Iead04539d3514920a5d6803c34715e5686124572 --- M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala 1 file changed, 9 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/11199/4 -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 4 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11199 to look at the new patch set (#2). Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Supporting Spark streaming DataFrame in KuduContext. KUDU-2539: Supporting Spark streaming DataFrame in KuduContext. This solution follows the way how other sinks ie. KafkaSink is implemented, see https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L87 Where on the DataFrame a queryExecution.toRdd.foreachPartition is called to access the InternalRows. Change-Id: Iead04539d3514920a5d6803c34715e5686124572 --- M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala 1 file changed, 13 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/11199/2 -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 2 Gerrit-Owner: Attila Piros Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.
Attila Piros has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11199 Change subject: Supporting Spark streaming DataFrame in KuduContext. .. Supporting Spark streaming DataFrame in KuduContext. KUDU-2539. Change-Id: Iead04539d3514920a5d6803c34715e5686124572 --- M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala 1 file changed, 13 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/11199/1 -- To view, visit http://gerrit.cloudera.org:8080/11199 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572 Gerrit-Change-Number: 11199 Gerrit-PatchSet: 1 Gerrit-Owner: Attila Piros