Hi,
Is it possible to limit the size of the batches returned by the Kafka consumer
for Spark Streaming?
I am asking because the first batch I get has hundred of millions of records
and it takes ages to process and checkpoint them.
Thank you.
Samy
That's nice and all, but I'd rather have a solution involving mapWithState
of course :) I'm just wondering why it doesn't support this use case yet.
On Tue, Oct 11, 2016 at 3:41 PM, Cody Koeninger wrote:
> They're telling you not to use the old function because it's linear
http://spark.apache.org/docs/latest/configuration.html
"This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see
below)."
On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane wrote:
> Hi,
>
> Is it
What are you expecting? If you want to update every key on every
batch, it's going to be linear on the number of keys... there's no
real way around that.
On Tue, Oct 11, 2016 at 9:49 AM, Daan Debie wrote:
> That's nice and all, but I'd rather have a solution involving
Hi all,
so I have a model which has been stored in S3. And I have a Scala webapp
which for certain requests loads the model and transforms submitted data
against it.
I'm not sure how to run this quickly on a single instance though. At the
moment Spark is being bundled up with the web app in an
Just out of curiosity, have you tried using separate group ids for the
separate streams?
On Oct 11, 2016 4:46 AM, "Matthias Niehoff"
wrote:
> I stripped down the job to just consume the stream and print it, without
> avro deserialization. When I only consume one
They're telling you not to use the old function because it's linear on the
total number of keys, not keys in the batch, so it's slow.
But if that's what you really want, go ahead and do it, and see if it
performs well enough.
On Oct 11, 2016 6:28 AM, "DandyDev" wrote:
Hi
Hello team. so I found and resolved the issue. In case if someone run into
same problem this was the problem.
>>Each nodes were allocated 1397MB memory for storages.
16/10/11 13:16:58 INFO storage.MemoryStore: MemoryStore started with
capacity 1397.3 MB
>> However, my RDD exceeded the storage
Similarly a Java DStream has a dstream method you can call to get the
underlying dstream.
On Oct 11, 2016 2:54 AM, "static-max" wrote:
> Hi Cody,
>
> thanks, rdd.rdd() did the trick. I now have the offsets via
> OffsetRange[] offsets = ((HasOffsetRanges)
I re-ran the job with DEBUG Log Level for org.apache.spark, kafka.consumer
and org.apache.kafka. Please find the output here:
http://pastebin.com/VgtRUQcB
most of the delay is introduced by *16/10/11 13:20:12 DEBUG RecurringTimer:
Callback for JobGenerator called at time x*, which repeats
If I explain, further more, In our server, when it is started, we also
start a spark cluster. Our server is an OSGI environment and we are calling
the SqlContext.SqlContext().stop() method to stop the master in OSGI
deactivate method.
On Tue, Oct 11, 2016 at 3:27 PM, Gimantha Bandara
Hi there,
I've built a Spark Streaming app that accepts certain events from Kafka, and
I want to keep some state between the events. So I've successfully used
mapWithState for that. The problem is, that I want the state for keys to be
updated on every batchInterval, because "lack" of events is
good point, I changed the group id to be unique for the separate streams
and now it works. Thanks!
Although changing this is ok for us, i am interested in the why :-) With
the old connector this was not a problem nor is it afaik with the pure
kafka consumer api
2016-10-11 14:30 GMT+02:00 Cody
The new underlying kafka consumer prefetches data and is generally heavier
weight, so it is cached on executors. Group id is part of the cache key. I
assumed kafka users would use different group ids for consumers they wanted
to be distinct, since otherwise would cause problems even with the
Hi,
Currently, I am running Spark using the standalone scheduler with 3
machines in our cluster. For these three machines, one runs Spark Master
and the other two run Spark Worker.
We run a machine learning application on this small-scale testbed. A
particular stage in my application is divided
I stripped down the job to just consume the stream and print it, without
avro deserialization. When I only consume one topic, everything is fine. As
soon as I add a second stream it stucks after about 5 minutes. So I
basically bails down to:
val kafkaParams = Map[String, String](
Hi all,
When we are forcefully shutting down the server, we observe the following
exception intermittently. What could be the reason? Please note that this
exception is printed in logs multiple times and stop our server shutting
down as soon as we trigger a forceful shutdown. What could be the
That's a good point about shuffle data compression. Still, it would be good
to benchmark the ideas behind https://github.com/apache/spark/pull/12761 I
think.
For many datasets, even within one partition the gradient sums etc can
remain very sparse. For example Criteo DAC data is extremely sparse
@Song, I have called an action but it did not cache as you can see in the
provided screenshot on my original email. It has cahced into Disk but not
memory.
@Chin Wei Low, I have 15GB memory allocated which is more than the dataset
size.
Any other suggestion please?
Kind regards,
Guru
On 11
you should have just tried it and let us know what your experience had
been! Anyways, after spending long hours on this problem I realized this is
actually a classLoader problem.
If you use spark-submit this exception should go away but you haven't told
us how you are submitting a Job such that
I am getting this error some times when I run pyspark with spark 2.0.0
App > at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
App > at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
App > at
I don't believe it will ever scale to spin up a whole distributed job to
serve one request. You can look possibly at the bits in mllib-local. You
might do well to export as something like PMML either with Spark's export
or JPMML and then load it into a web container and score it, without Spark
All,
We have an use case in which 2 spark streaming jobs in same EMR cluster.
I am thinking of allowing multiple streaming contexts and run them as 2
separate spark-submit with wait for app completion set to false.
With this, the failure detection and monitoring seems obscure and doesn't
seem
unsubscribe
--
Jacky Wang
I tried wrapping my Tuple class (which is generated by Avro) in a
class that implements Serializable, but now I'm getting a
ClassNotFoundException in my Spark application. The exception is
thrown while trying to deserialize checkpoint state:
Hi,
I have a problem that's bothering me for a few days, and I'm pretty out of
ideas.
I built a Spark docker container where Spark runs in standalone mode. Both
master and worker are started there.
Now I tried to deploy my Spark Scala App in a separate container(same
machine) where I pass the
There are some known issues in 1.6.0, e.g.,
https://issues.apache.org/jira/browse/SPARK-12591
Could you try 1.6.1?
On Tue, Oct 11, 2016 at 9:55 AM, Joey Echeverria wrote:
> I tried wrapping my Tuple class (which is generated by Avro) in a
> class that implements Serializable,
Hey, I just found a promo code for Spark Summit Europe that saves 20%. It’s
"Summit16" - I love Brussels and just registered! Who’s coming with me to
get their Spark on?!
Cheers,
Andrew
I have question about how to use textFileStream, I have included a code
snippet below. I am trying to read .gz files that are getting put into my
bucket. I do not want to specify the schema, I have a similar feature that
just does spark.read.json(inputBucket). This works great and if I can get
Look here
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
Probably it will help a bit.
Best regards,
Denis
11 Окт 2016 г. 23:49 пользователь "Xiaoye Sun"
написал:
> Hi,
>
> Currently, I am running Spark using the
Try to build a flat (uber) jar which includes all dependencies.
11 Окт 2016 г. 22:11 пользователь "doruchiulan"
написал:
> Hi,
>
> I have a problem that's bothering me for a few days, and I'm pretty out of
> ideas.
>
> I built a Spark docker container where Spark runs in
Hi Mich ,
you can create dataframe from RDD in below manner also
val df = sqlContext.createDataFrame(rdd,schema)
val df = sqlContext.createDataFrame(rdd)
The below article also may help you :
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
On 11 October 2016 at
This Job will fail after about 5 minutes:
object SparkJobMinimal {
//read Avro schemas
var stream = getClass.getResourceAsStream("/avro/AdRequestLog.avsc")
val avroSchemaAdRequest =
scala.io.Source.fromInputStream(stream).getLines.mkString
stream.close
stream =
Hi,
I am currently running this code from my IDE(Eclipse). I tried adding the
scope "provided" to the dependency without any effect. Should I build this
and submit it using the spark-submit command?
Thanks
Vaibhav
On 11 October 2016 at 04:36, Jakob Odersky wrote:
> Just
Hi Cody,
thanks, rdd.rdd() did the trick. I now have the offsets via
OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
But how can you commit the offset to Kafka?
Casting the JavaInputDStream throws an CCE:
CanCommitOffsets cco = (CanCommitOffsets) directKafkaStream; //
Hi Selvam,
Is your 35GB parquet file split up into multiple S3 objects or just one big
Parquet file?
If its just one big file then I believe only one executor will be able to
work on it until some job action partitions the data into smaller chunks.
On 11 October 2016 at 06:03, Selvam Raman
36 matches
Mail list logo