Re: ClassLoader problem - java.io.InvalidClassException: scala.Option; local class incompatible

2017-02-20 Thread Kohki Nishio
Created a jira, I believe SBT is a valid use case, but it's resolved as Not
a Problem ..

https://issues.apache.org/jira/browse/SPARK-19675


On Mon, Feb 20, 2017 at 10:36 PM, Kohki Nishio  wrote:

> Hello, I'm writing a Play Framework application which does Spark, however
> I'm getting below
>
> java.io.InvalidClassException: scala.Option; local class incompatible:
> stream classdesc serialVersionUID = -114498752079829388, local class
> serialVersionUID = 5081326844987135632
> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
> at java.io.ObjectInputStream.readNonProxyDesc(
> ObjectInputStream.java:1630)
> at java.io.ObjectInputStream.readClassDesc(
> ObjectInputStream.java:1521)
> at java.io.ObjectInputStream.readNonProxyDesc(
> ObjectInputStream.java:1630)
> at java.io.ObjectInputStream.readClassDesc(
> ObjectInputStream.java:1521)
> at java.io.ObjectInputStream.readOrdinaryObject(
> ObjectInputStream.java:1781)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
> at java.io.ObjectInputStream.defaultReadFields(
> ObjectInputStream.java:2018)
> at java.io.ObjectInputStream.readSerialData(
> ObjectInputStream.java:1942)
> at java.io.ObjectInputStream.readOrdinaryObject(
> ObjectInputStream.java:1808)
>
> It's because I'm launching the application via SBT and sbt-launch.jar
> contains Scala 2.10 binary. However my Spark binary is for 2.11 that's why
> I'm getting this. I believe ExecutorClassLoader needs to override loadClass
> method as well, can anyone comment on this ? It's picking up Option class
> from system classloader.
>
> Thanks
> --
> Kohki Nishio
>



-- 
Kohki Nishio


ClassLoader problem - java.io.InvalidClassException: scala.Option; local class incompatible

2017-02-20 Thread Kohki Nishio
Hello, I'm writing a Play Framework application which does Spark, however
I'm getting below

java.io.InvalidClassException: scala.Option; local class incompatible:
stream classdesc serialVersionUID = -114498752079829388, local class
serialVersionUID = 5081326844987135632
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)

It's because I'm launching the application via SBT and sbt-launch.jar
contains Scala 2.10 binary. However my Spark binary is for 2.11 that's why
I'm getting this. I believe ExecutorClassLoader needs to override loadClass
method as well, can anyone comment on this ? It's picking up Option class
from system classloader.

Thanks
-- 
Kohki Nishio


[SparkSQL] pre-check syntex before running spark job?

2017-02-20 Thread Linyuxin
Hi All,
Is there any tool/api to check the sql syntax without running spark job 
actually?

Like the siddhiQL on storm here:
SiddhiManagerService. validateExecutionPlan
https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java
it can validate the syntax before running the sql on storm 

this is very useful for exposing sql string as a DSL of the platform.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Cody Koeninger
So there's no reason to use checkpointing at all, right?  Eliminate
that as a possible source of problems.

Probably unrelated, but this also isn't a very good way to benchmark.
Kafka producers are threadsafe, there's no reason to create one for
each partition.

On Mon, Feb 20, 2017 at 4:43 PM, Muhammad Haseeb Javed
<11besemja...@seecs.edu.pk> wrote:
> This is the code that I have been trying is giving me this error. No
> complicated operation being performed on the topics as far as I can see.
>
> class Identity() extends BenchBase {
>
>
>   override def process(lines: DStream[(Long, String)], config:
> SparkBenchConfig): Unit = {
>
> val reportTopic = config.reporterTopic
>
> val brokerList = config.brokerList
>
>
> lines.foreachRDD(rdd => rdd.foreachPartition( partLines => {
>
>   val reporter = new KafkaReporter(reportTopic, brokerList)
>
>   partLines.foreach{ case (inTime , content) =>
>
> val outTime = System.currentTimeMillis()
>
> reporter.report(inTime, outTime)
>
> if(config.debugMode) {
>
>   println("Event: " + inTime + ", " + outTime)
>
> }
>
>   }
>
> }))
>
>   }
>
> }
>
>
> On Mon, Feb 20, 2017 at 3:10 PM, Cody Koeninger  wrote:
>>
>> That's an indication that the beginning offset for a given batch is
>> higher than the ending offset, i.e. something is seriously wrong.
>>
>> Are you doing anything at all odd with topics, i.e. deleting and
>> recreating them, using compacted topics, etc?
>>
>> Start off with a very basic stream over the same kafka topic that just
>> does foreach println or similar, with no checkpointing at all, and get
>> that working first.
>>
>> On Mon, Feb 20, 2017 at 12:10 PM, Muhammad Haseeb Javed
>> <11besemja...@seecs.edu.pk> wrote:
>> > Update: I am using Spark 2.0.2 and  Kafka 0.8.2 with Scala 2.10
>> >
>> > On Mon, Feb 20, 2017 at 1:06 PM, Muhammad Haseeb Javed
>> > <11besemja...@seecs.edu.pk> wrote:
>> >>
>> >> I am PhD student at Ohio State working on a study to evaluate streaming
>> >> frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
>> >> benchmarks. But I think I am having a problem  with Spark. I have Spark
>> >> Streaming application which I am trying to run on a 5 node cluster
>> >> (including master). I have 2 zookeeper and 4 kafka brokers. However,
>> >> whenever I run a Spark Streaming application I encounter the following
>> >> error:
>> >>
>> >> java.lang.IllegalArgumentException: requirement failed: numRecords must
>> >> not be negative
>> >> at scala.Predef$.require(Predef.scala:224)
>> >> at
>> >>
>> >> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
>> >> at
>> >>
>> >> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
>> >> at
>> >>
>> >> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>> >> at
>> >>
>> >> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>> >> at
>> >> scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>> >> at
>> >>
>> >> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>> >> at
>> >>
>> >> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>> >> at
>> >>
>> >> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>> >> at
>> >>
>> >> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>> >> at
>> >>
>> >> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>> >> at scala.Option.orElse(Option.scala:289)
>> >>
>> >> The application starts fine, but as soon as the Kafka producers start
>> >> emitting the stream data I start receiving the aforementioned error
>> >> repeatedly.
>> >>
>> >> I have tried removing Spark Streaming checkpointing files as has been
>> >> suggested in similar posts on the internet. However, the problem
>> >> persists
>> >> even if I start a Kafka topic and its corresponding consumer Spark
>> >> Streaming
>> >> application for the first time. Also the problem could not be offset
>> >> related
>> >> as I start the topic for the first time.
>> >>
>> >> Although the application seems to be processing the stream properly as
>> >> I
>> >> can see by the benchmark numbers generated. However, the numbers are
>> >> way of
>> >> from what I got for Storm and Flink, suspecting me to believe that
>> >> there is
>> >> something wrong with the pipeline and Spark is not able to process the
>> >> stream as cleanly as it should. Any help in this regard would be really
>> >> appreciated.
>> >
>> >
>
>

-

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Muhammad Haseeb Javed
This is the code that I have been trying is giving me this error. No
complicated operation being performed on the topics as far as I can see.

class Identity() extends BenchBase {


  override def process(lines: DStream[(Long, String)], config:
SparkBenchConfig): Unit = {

val reportTopic = config.reporterTopic

val brokerList = config.brokerList


lines.foreachRDD(rdd => rdd.foreachPartition( partLines => {

  val reporter = new KafkaReporter(reportTopic, brokerList)

  partLines.foreach{ case (inTime , content) =>

val outTime = System.currentTimeMillis()

reporter.report(inTime, outTime)

if(config.debugMode) {

  println("Event: " + inTime + ", " + outTime)

}

  }

}))

  }

}

On Mon, Feb 20, 2017 at 3:10 PM, Cody Koeninger  wrote:

> That's an indication that the beginning offset for a given batch is
> higher than the ending offset, i.e. something is seriously wrong.
>
> Are you doing anything at all odd with topics, i.e. deleting and
> recreating them, using compacted topics, etc?
>
> Start off with a very basic stream over the same kafka topic that just
> does foreach println or similar, with no checkpointing at all, and get
> that working first.
>
> On Mon, Feb 20, 2017 at 12:10 PM, Muhammad Haseeb Javed
> <11besemja...@seecs.edu.pk> wrote:
> > Update: I am using Spark 2.0.2 and  Kafka 0.8.2 with Scala 2.10
> >
> > On Mon, Feb 20, 2017 at 1:06 PM, Muhammad Haseeb Javed
> > <11besemja...@seecs.edu.pk> wrote:
> >>
> >> I am PhD student at Ohio State working on a study to evaluate streaming
> >> frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
> >> benchmarks. But I think I am having a problem  with Spark. I have Spark
> >> Streaming application which I am trying to run on a 5 node cluster
> >> (including master). I have 2 zookeeper and 4 kafka brokers. However,
> >> whenever I run a Spark Streaming application I encounter the following
> >> error:
> >>
> >> java.lang.IllegalArgumentException: requirement failed: numRecords must
> >> not be negative
> >> at scala.Predef$.require(Predef.scala:224)
> >> at
> >> org.apache.spark.streaming.scheduler.StreamInputInfo.<
> init>(InputInfoTracker.scala:38)
> >> at
> >> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(
> DirectKafkaInputDStream.scala:165)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> >> at scala.util.DynamicVariable.withValue(DynamicVariable.
> scala:58)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> >> at
> >> org.apache.spark.streaming.dstream.DStream.
> createRDDWithLocalProperties(DStream.scala:415)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1.apply(DStream.scala:335)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1.apply(DStream.scala:333)
> >> at scala.Option.orElse(Option.scala:289)
> >>
> >> The application starts fine, but as soon as the Kafka producers start
> >> emitting the stream data I start receiving the aforementioned error
> >> repeatedly.
> >>
> >> I have tried removing Spark Streaming checkpointing files as has been
> >> suggested in similar posts on the internet. However, the problem
> persists
> >> even if I start a Kafka topic and its corresponding consumer Spark
> Streaming
> >> application for the first time. Also the problem could not be offset
> related
> >> as I start the topic for the first time.
> >>
> >> Although the application seems to be processing the stream properly as I
> >> can see by the benchmark numbers generated. However, the numbers are
> way of
> >> from what I got for Storm and Flink, suspecting me to believe that
> there is
> >> something wrong with the pipeline and Spark is not able to process the
> >> stream as cleanly as it should. Any help in this regard would be really
> >> appreciated.
> >
> >
>


Fwd: Will Spark ever run the same task at the same time

2017-02-20 Thread Mark Hamstra
First, the word you are looking for is "straggler", not "strangler" -- very
different words. Second, "idempotent" doesn't mean "only happens once", but
rather "if it does happen more than once, the effect is no different than
if it only happened once".

It is possible to insert a nearly limitless variety of side-effecting code
into Spark Tasks, and there is no guarantee from Spark that such code will
execute idempotently. Speculation is one way that a Task can run more than
once, but it is not the only way. A simple FetchFailure (from a lost
Executor or another reason) will mean that a Task has to be re-run in order
to re-compute the missing outputs from a prior execution. In general, Spark
will run a Task as many times as needed to satisfy the requirements of the
Jobs it is requested to fulfill, and you can assume neither that a Task
will run only once nor that it will execute idempotently (unless, of
course, it is side-effect free). Guaranteeing idempotency requires a higher
level coordinator with access to information on all Task executions. The
OutputCommitCoordinator handles that guarantee for HDFS writes, and the
JIRA discussion associated with the introduction of
the OutputCommitCoordinator covers most of the design issues:
https://issues.apache.org/jira/browse/SPARK-4879

On Thu, Feb 16, 2017 at 10:34 AM, Ji Yan  wrote:

> Dear spark users,
>
> Is there any mechanism in Spark that does not guarantee the idempotent
> nature? For example, for stranglers, the framework might start another task
> assuming the strangler is slow while the strangler is still running. This
> would be annoying sometime when say the task is writing to a file, but have
> the same tasks running at the same time may corrupt the file. From the
> documentation page, I know that Spark's speculative execution mode is
> turned off by default. Does anyone know any other mechanism in Spark that
> may cause problem in scenario like this?
>
> Thanks
> Ji
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>


Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Cody Koeninger
That's an indication that the beginning offset for a given batch is
higher than the ending offset, i.e. something is seriously wrong.

Are you doing anything at all odd with topics, i.e. deleting and
recreating them, using compacted topics, etc?

Start off with a very basic stream over the same kafka topic that just
does foreach println or similar, with no checkpointing at all, and get
that working first.

On Mon, Feb 20, 2017 at 12:10 PM, Muhammad Haseeb Javed
<11besemja...@seecs.edu.pk> wrote:
> Update: I am using Spark 2.0.2 and  Kafka 0.8.2 with Scala 2.10
>
> On Mon, Feb 20, 2017 at 1:06 PM, Muhammad Haseeb Javed
> <11besemja...@seecs.edu.pk> wrote:
>>
>> I am PhD student at Ohio State working on a study to evaluate streaming
>> frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
>> benchmarks. But I think I am having a problem  with Spark. I have Spark
>> Streaming application which I am trying to run on a 5 node cluster
>> (including master). I have 2 zookeeper and 4 kafka brokers. However,
>> whenever I run a Spark Streaming application I encounter the following
>> error:
>>
>> java.lang.IllegalArgumentException: requirement failed: numRecords must
>> not be negative
>> at scala.Predef$.require(Predef.scala:224)
>> at
>> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
>> at
>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>> at
>> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>> at scala.Option.orElse(Option.scala:289)
>>
>> The application starts fine, but as soon as the Kafka producers start
>> emitting the stream data I start receiving the aforementioned error
>> repeatedly.
>>
>> I have tried removing Spark Streaming checkpointing files as has been
>> suggested in similar posts on the internet. However, the problem persists
>> even if I start a Kafka topic and its corresponding consumer Spark Streaming
>> application for the first time. Also the problem could not be offset related
>> as I start the topic for the first time.
>>
>> Although the application seems to be processing the stream properly as I
>> can see by the benchmark numbers generated. However, the numbers are way of
>> from what I got for Storm and Flink, suspecting me to believe that there is
>> something wrong with the pipeline and Spark is not able to process the
>> stream as cleanly as it should. Any help in this regard would be really
>> appreciated.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Muhammad Haseeb Javed
Update: I am using Spark 2.0.2 and  Kafka 0.8.2 with Scala 2.10

On Mon, Feb 20, 2017 at 1:06 PM, Muhammad Haseeb Javed <
11besemja...@seecs.edu.pk> wrote:

> I am PhD student at Ohio State working on a study to evaluate streaming
> frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
> benchmarks. But I think I am having a problem  with Spark. I have Spark
> Streaming application which I am trying to run on a 5 node cluster
> (including master). I have 2 zookeeper and 4 kafka brokers. However,
> whenever I run a Spark Streaming application I encounter the following
> error:
>
> java.lang.IllegalArgumentException: requirement failed: numRecords must not 
> be negative
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:289)
>
> The application starts fine, but as soon as the Kafka producers start
> emitting the stream data I start receiving the aforementioned error
> repeatedly.
>
> I have tried removing Spark Streaming checkpointing files as has been
> suggested in similar posts on the internet. However, the problem persists
> even if I start a Kafka topic and its corresponding consumer Spark
> Streaming application for the first time. Also the problem could not be
> offset related as I start the topic for the first time.
> Although the application seems to be processing the stream properly as I
> can see by the benchmark numbers generated. However, the numbers are way of
> from what I got for Storm and Flink, suspecting me to believe that there is
> something wrong with the pipeline and Spark is not able to process the
> stream as cleanly as it should. Any help in this regard would be really
> appreciated.
>


Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Muhammad Haseeb Javed
I am PhD student at Ohio State working on a study to evaluate streaming
frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
benchmarks. But I think I am having a problem  with Spark. I have Spark
Streaming application which I am trying to run on a 5 node cluster
(including master). I have 2 zookeeper and 4 kafka brokers. However,
whenever I run a Spark Streaming application I encounter the following
error:

java.lang.IllegalArgumentException: requirement failed: numRecords
must not be negative
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
at 
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)

The application starts fine, but as soon as the Kafka producers start
emitting the stream data I start receiving the aforementioned error
repeatedly.

I have tried removing Spark Streaming checkpointing files as has been
suggested in similar posts on the internet. However, the problem persists
even if I start a Kafka topic and its corresponding consumer Spark
Streaming application for the first time. Also the problem could not be
offset related as I start the topic for the first time.
Although the application seems to be processing the stream properly as I
can see by the benchmark numbers generated. However, the numbers are way of
from what I got for Storm and Flink, suspecting me to believe that there is
something wrong with the pipeline and Spark is not able to process the
stream as cleanly as it should. Any help in this regard would be really
appreciated.


Re: Will Spark ever run the same task at the same time

2017-02-20 Thread Steve Loughran

> On 16 Feb 2017, at 18:34, Ji Yan  wrote:
> 
> Dear spark users,
> 
> Is there any mechanism in Spark that does not guarantee the idempotent 
> nature? For example, for stranglers, the framework might start another task 
> assuming the strangler is slow while the strangler is still running. This 
> would be annoying sometime when say the task is writing to a file, but have 
> the same tasks running at the same time may corrupt the file. From the 
> documentation page, I know that Spark's speculative execution mode is turned 
> off by default. Does anyone know any other mechanism in Spark that may cause 
> problem in scenario like this?

 It's not so much "Two tasks writing to the same file' as "two tasks writing to 
different places with the work renamed into place at the end"

speculation is the key case when there's >1  writer, though they do write to 
different directories; the spark commit protocol guarantees that only the 
committed task gets its work into the final output.

Some failure modes *may* have >1 executor running the same work, right up to 
the point where the task commit operation is started. More specifically, a 
network partition may cause the executor to lose touch with the driver, and the 
driver to pass the same task on to another executor, while the existing 
executor keeps going. Its when that first executor tries to commit the data 
that you get a guarantee that the work doesn't get committed (no connectivity 
=> no commit, connectivity resumed => driver will tell executor it's been 
aborted).

If you are working with files outside of the tasks' working directory, then the 
outcome of failure will be "undefined". The FileCommitProtocol lets you  ask 
for a temp file which is rename()d to the destination in the commit. Use this 
and the files will only appear the task is committed. Even there there is a 
small, but non-zero chance that the commit may fail partway through, in which 
case the outcome is, as they say, "undefined". Avoid that today by not manually 
adding custom partitions to data sources in your hive metastore. 

Steve




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Basic Grouping Question

2017-02-20 Thread ayan guha
Hi

Once you specify the aggregates on group By function (I am assuming you
mean dataframe here?), grouping and aggregate both works in distributed
fashion (you may want to look into how reduceByKey and/or aggregateBykey
work).

On Mon, Feb 20, 2017 at 10:23 PM, Marco Mans  wrote:

> Hi!
>
> I'm new to Spark and trying to write my first spark job on some data I
> have.
> The data is in this (parquet) format:
>
> Code,timestamp, value
> A, 2017-01-01, 123
> A, 2017-01-02, 124
> A, 2017-01-03, 126
> B, 2017-01-01, 127
> B, 2017-01-02, 126
> B, 2017-01-03, 123
>
> I want to write a little map-reduce application that must be run on each
> 'code'.
> So I would need to group the data on the 'code' column and than execute
> the map and the reduce steps on each code; 2 times in this example, A and B.
>
> But when I group the data (groupBy-function), it returns a
> RelationalDatasetGroup. On this I cannot apply the map and reduce function.
>
> I have the feeling that I am running in the wrong direction. Does anyone
> know how to approach this? (I hope I explained it right, so it can be
> understand :))
>
> Regards,
> Marco
>



-- 
Best Regards,
Ayan Guha


Message loss in streaming even with graceful shutdown

2017-02-20 Thread Noorul Islam K M

Hi all,

I have a streaming application with batch interval 10 seconds.

val sparkConf = new SparkConf().setAppName("RMQWordCount")
  .set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(sparkConf, Seconds(10))

I also use reduceByKeyAndWindow() API for aggregation at window interval
of 5 minutes.

But when I send a SIGTERM to the streaming process at around 4th minute,
I don't see reduceByKeyAndWindow() action taking place. But the data is
already read for 4 minutes. I thought graceful shutdown would trigger
the action with received messages.

Am I missing something?

Thanks and regards
Noorul

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Basic Grouping Question

2017-02-20 Thread Marco Mans
Hi!

I'm new to Spark and trying to write my first spark job on some data I have.
The data is in this (parquet) format:

Code,timestamp, value
A, 2017-01-01, 123
A, 2017-01-02, 124
A, 2017-01-03, 126
B, 2017-01-01, 127
B, 2017-01-02, 126
B, 2017-01-03, 123

I want to write a little map-reduce application that must be run on each
'code'.
So I would need to group the data on the 'code' column and than execute the
map and the reduce steps on each code; 2 times in this example, A and B.

But when I group the data (groupBy-function), it returns a
RelationalDatasetGroup. On this I cannot apply the map and reduce function.

I have the feeling that I am running in the wrong direction. Does anyone
know how to approach this? (I hope I explained it right, so it can be
understand :))

Regards,
Marco


Spark streaming on AWS EC2 error . Please help

2017-02-20 Thread shyla deshpande
I am running Spark streaming on AWS EC2 in standalone mode.

When I do a spark-submit, I get the following message. I am subscribing to
3 kafka topics and it is reading and processing just 2 topics. Works fine
in local mode.
Appreciate your help. Thanks

Exception in thread "pool-26-thread-132" java.lang.NullPointerException
at
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:225)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Re: Spark Worker can't find jar submitted programmatically

2017-02-20 Thread Cosmin Posteuca
Hi Zoran,

I think you are looking for --jars parameter/argument to spark-submit

When using spark-submit, the application jar along with any jars included
> with the --jars option will be automatically transferred to the cluster.
> URLs supplied after --jars must be separated by commas. (
> http://spark.apache.org/docs/latest/submitting-applications.html)


I don't know if this work on standalone mode, but for me work on yarn mode.

Thanks,
Cosmin

2017-02-17 2:46 GMT+02:00 jeremycod :

> Hi, I'm trying to create application that would programmatically submit
> jar file to Spark standalone cluster running on my local PC. However, I'm
> always getting the error WARN TaskSetManager:66 - Lost task 1.0 in stage
> 0.0 (TID 1, 192.168.2.68, executor 0): java.lang.RuntimeException: Stream
> '/jars/sample-spark-maven-one-jar.jar' was not found. I'm creating the
> SparkContext in the following way: val sparkConf = new SparkConf()
> sparkConf.setMaster("spark://zoran-Latitude-E5420:7077")
> sparkConf.set("spark.cores_max","2") 
> sparkConf.set("spark.executor.memory","2g")
> sparkConf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> sparkConf.setAppName("Test application") sparkConf.set("spark.ui.port","4041")
> sparkConf.set("spark.local.ip","192.168.2.68") val
> oneJar="/samplesparkmaven/target/sample-spark-maven-one-jar.jar"
> sparkConf.setJars(List(oneJar)) val sc = new SparkContext(sparkConf) I'm
> using Spark 2.1.0 in standalone mode with master and one worker. Does
> anyone have idea where the problem might be or how to investigate it
> further? Thanks, Zoran
> --
> View this message in context: Spark Worker can't find jar submitted
> programmatically
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>