Re: Structured Streaming, Reading and Updating a variable

Martin Engen Wed, 16 May 2018 03:23:14 -0700

I have been testing some with aggregations, but I seem to hit a wall on two 
issues.
example:
val avg = areaStateDf.groupBy($"plantKey").avg("sensor")


1) How can I use the result from an aggr within the same stream, to do further 
calculations?
2) It seems to be very slow. If I want a moving window of 24 hours, and to have 
a moving average on some calculations within this. When testing locally with 
using

The Accumulator issue:
Simple Counter Accumulator:
object Test {
  private val spark = SparkHelper.getSparkSession()
  import spark.implicits._
  import com.datastax.spark.connector._
  val counter = spark.sparkContext.longAccumulator("counter")

  val fetchData = () => {
    counter.add(2)
    counter.value
  }

  val fetchdataUDF = spark.sqlContext.udf.register("testUDF", fetchData)

  def calculate(areaStateDf: DataFrame): StreamingQuery = {
    import spark.implicits._
    val ds = areaStateDf.select($"areaKey").withColumn("fetchedData", 
fetchdataUDF())
    KafkaSinks.debugStream(ds, "volumTest")
  }
}
I would create a custom accumulator to include a smoothing algorithm, but cant 
seem to be able to get a normal counter working.
This works locally, but on the server running Docker (using a master and 1 
worker) throws this error:

18/05/16 08:35:22 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
(TID 3, 172.17.0.5, executor 0): java.lang.ExceptionInInitializerError
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:24)
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:15)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: A master URL must be set in your 
configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
at com.cambi.assurance.spark.SparkHelper$.getSparkSession(SparkHelper.scala:28)
at com.client.spark.calculations.Test$.<init>(ThpLoad1.scala:10)
at com.client.spark.calculations.Test$.<clinit>(ThpLoad1.scala)
... 18 more

18/05/16 08:35:22 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 3.0 
(TID 4, 172.17.0.5, executor 0): java.lang.NoClassDefFoundError: Could not 
initialize class com.client.spark.calculations.Test$
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:24)
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:15)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)


Any ideas about how to handle this error?


        Thanks,
        Martin Engen
________________________________
From: Lalwani, Jayesh <jayesh.lalw...@capitalone.com>
Sent: Tuesday, May 15, 2018 9:59 PM
To: Martin Engen; user@spark.apache.org
Subject: Re: Structured Streaming, Reading and Updating a variable


Do you have a code sample, and detailed error message/exception to show?



From: Martin Engen <martin.en...@outlook.com>
Date: Tuesday, May 15, 2018 at 9:24 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Structured Streaming, Reading and Updating a variable



Hello,



I'm working with Structured Streaming, and I need a method of keeping a running 
average based on last 24hours of data.

To help with this, I can use Exponential Smoothing, which means I really only 
need to store 1 value from a previous calculation into the new, and update this 
variable as calculations carry on.



Implementing this is a much bigger challenge then I ever imagined.





I've tried using Accumulators and to Query/Store data to Cassandra after every 
calculation. Both methods worked somewhat locally , but I don't seem to be able 
to use these in the Spark Worker Nodes,  as I get the error

"java.lang.NoClassDefFoundError: Could not initialize class error" both for the 
accumulator and the cassandra connection libary



How can you read/update a variable while doing calculations using Structured 
Streaming?



Thank you





________________________________

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: Structured Streaming, Reading and Updating a variable

Reply via email to