Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
Yeah use streaming to gather the incoming logs and write to log file then
run a spark job evry 5 minutes to process the counts. Got it. Thanks a
lot.

On 07:07, Mon, 26 Jan 2015 Tobias Pfeiffer  wrote:

> Hi,
>
> On Tue, Jan 20, 2015 at 8:16 PM, balu.naren  wrote:
>
>> I am a beginner to spark streaming. So have a basic doubt regarding
>> checkpoints. My use case is to calculate the no of unique users by day. I
>> am using reduce by key and window for this. Where my window duration is 24
>> hours and slide duration is 5 mins.
>>
> Adding to what others said, this feels more like a task for "run a Spark
> job every five minutes using cron" than using the sliding window
> functionality from Spark Streaming.
>
> Tobias
>


Re: spark streaming with checkpoint

2015-01-25 Thread Tobias Pfeiffer
Hi,

On Tue, Jan 20, 2015 at 8:16 PM, balu.naren  wrote:

> I am a beginner to spark streaming. So have a basic doubt regarding
> checkpoints. My use case is to calculate the no of unique users by day. I
> am using reduce by key and window for this. Where my window duration is 24
> hours and slide duration is 5 mins.
>
Adding to what others said, this feels more like a task for "run a Spark
job every five minutes using cron" than using the sliding window
functionality from Spark Streaming.

Tobias


RE: spark streaming with checkpoint

2015-01-22 Thread Shao, Saisai
Hi,

A new RDD will be created in each slide duration, if there’s no data coming, an 
empty RDD will be generated.

I’m not sure there’s way to alleviate your problem from Spark side. Is your 
application design have to build such a large window, can you change your 
implementation if it is easy for you?

I think it’s better and easy for you to change your implementation rather than 
rely on Spark to handle this.

Thanks
Jerry

From: Balakrishnan Narendran [mailto:balu.na...@gmail.com]
Sent: Friday, January 23, 2015 12:19 AM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: spark streaming with checkpoint

Thank you Jerry,
   Does the window operation create new RDDs for each slide duration..? I 
am asking this because i see a constant increase in memory even when there is 
no logs received.
If not checkpoint is there any alternative that you would suggest.?


On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai 
mailto:saisai.s...@intel.com>> wrote:
Hi,

Seems you have such a large window (24 hours), so the phenomena of memory 
increasing is expectable, because of window operation will cache the RDD within 
this window in memory. So for your requirement, memory should be enough to hold 
the data of 24 hours.

I don’t think checkpoint in Spark Streaming can alleviate such problem, because 
checkpoint are mainly for fault tolerance.

Thanks
Jerry

From: balu.naren [mailto:balu.na...@gmail.com<mailto:balu.na...@gmail.com>]
Sent: Tuesday, January 20, 2015 7:17 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: spark streaming with checkpoint


I am a beginner to spark streaming. So have a basic doubt regarding 
checkpoints. My use case is to calculate the no of unique users by day. I am 
using reduce by key and window for this. Where my window duration is 24 hours 
and slide duration is 5 mins. I am updating the processed record to mongodb. 
Currently I am replace the existing record each time. But I see the memory is 
slowly increasing over time and kills the process after 1 and 1/2 hours(in aws 
small instance). The DB write after the restart clears all the old data. So I 
understand checkpoint is the solution for this. But my doubt is

  *   What should my check point duration be..? As per documentation it says 
5-10 times of slide duration. But I need the data of entire day. So it is ok to 
keep 24 hrs.
  *   Where ideally should the checkpoint be..? Initially when I receive the 
stream or just before the window operation or after the data reduction has 
taken place.

Appreciate your help.
Thank you


View this message in context: spark streaming with 
checkpoint<http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html>
Sent from the Apache Spark User List mailing list 
archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.



Re: spark streaming with checkpoint

2015-01-22 Thread Jörn Franke
Maybe you use a wrong approach - try something like hyperloglog or bitmap
structures as you can find them, for instance, in  redis. They are much
smaller
Le 22 janv. 2015 17:19, "Balakrishnan Narendran"  a
écrit :

> Thank you Jerry,
>Does the window operation create new RDDs for each slide
> duration..? I am asking this because i see a constant increase in memory
> even when there is no logs received.
> If not checkpoint is there any alternative that you would suggest.?
>
>
> On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai 
> wrote:
>
>>  Hi,
>>
>>
>>
>> Seems you have such a large window (24 hours), so the phenomena of memory
>> increasing is expectable, because of window operation will cache the RDD
>> within this window in memory. So for your requirement, memory should be
>> enough to hold the data of 24 hours.
>>
>>
>>
>> I don’t think checkpoint in Spark Streaming can alleviate such problem,
>> because checkpoint are mainly for fault tolerance.
>>
>>
>>
>> Thanks
>>
>> Jerry
>>
>>
>>
>> *From:* balu.naren [mailto:balu.na...@gmail.com]
>> *Sent:* Tuesday, January 20, 2015 7:17 PM
>> *To:* user@spark.apache.org
>> *Subject:* spark streaming with checkpoint
>>
>>
>>
>> I am a beginner to spark streaming. So have a basic doubt regarding
>> checkpoints. My use case is to calculate the no of unique users by day. I
>> am using reduce by key and window for this. Where my window duration is 24
>> hours and slide duration is 5 mins. I am updating the processed record to
>> mongodb. Currently I am replace the existing record each time. But I see
>> the memory is slowly increasing over time and kills the process after 1 and
>> 1/2 hours(in aws small instance). The DB write after the restart clears all
>> the old data. So I understand checkpoint is the solution for this. But my
>> doubt is
>>
>>- What should my check point duration be..? As per documentation it
>>says 5-10 times of slide duration. But I need the data of entire day. So 
>> it
>>is ok to keep 24 hrs.
>>- Where ideally should the checkpoint be..? Initially when I receive
>>the stream or just before the window operation or after the data reduction
>>has taken place.
>>
>>
>> Appreciate your help.
>> Thank you
>>  --
>>
>> View this message in context: spark streaming with checkpoint
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>
>


Re: spark streaming with checkpoint

2015-01-22 Thread Balakrishnan Narendran
Thank you Jerry,
   Does the window operation create new RDDs for each slide duration..?
I am asking this because i see a constant increase in memory even when
there is no logs received.
If not checkpoint is there any alternative that you would suggest.?


On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai  wrote:

>  Hi,
>
>
>
> Seems you have such a large window (24 hours), so the phenomena of memory
> increasing is expectable, because of window operation will cache the RDD
> within this window in memory. So for your requirement, memory should be
> enough to hold the data of 24 hours.
>
>
>
> I don’t think checkpoint in Spark Streaming can alleviate such problem,
> because checkpoint are mainly for fault tolerance.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* balu.naren [mailto:balu.na...@gmail.com]
> *Sent:* Tuesday, January 20, 2015 7:17 PM
> *To:* user@spark.apache.org
> *Subject:* spark streaming with checkpoint
>
>
>
> I am a beginner to spark streaming. So have a basic doubt regarding
> checkpoints. My use case is to calculate the no of unique users by day. I
> am using reduce by key and window for this. Where my window duration is 24
> hours and slide duration is 5 mins. I am updating the processed record to
> mongodb. Currently I am replace the existing record each time. But I see
> the memory is slowly increasing over time and kills the process after 1 and
> 1/2 hours(in aws small instance). The DB write after the restart clears all
> the old data. So I understand checkpoint is the solution for this. But my
> doubt is
>
>- What should my check point duration be..? As per documentation it
>says 5-10 times of slide duration. But I need the data of entire day. So it
>is ok to keep 24 hrs.
>- Where ideally should the checkpoint be..? Initially when I receive
>the stream or just before the window operation or after the data reduction
>has taken place.
>
>
> Appreciate your help.
> Thank you
>  --
>
> View this message in context: spark streaming with checkpoint
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


RE: spark streaming with checkpoint

2015-01-20 Thread Shao, Saisai
Hi,

Seems you have such a large window (24 hours), so the phenomena of memory 
increasing is expectable, because of window operation will cache the RDD within 
this window in memory. So for your requirement, memory should be enough to hold 
the data of 24 hours.

I don't think checkpoint in Spark Streaming can alleviate such problem, because 
checkpoint are mainly for fault tolerance.

Thanks
Jerry

From: balu.naren [mailto:balu.na...@gmail.com]
Sent: Tuesday, January 20, 2015 7:17 PM
To: user@spark.apache.org
Subject: spark streaming with checkpoint


I am a beginner to spark streaming. So have a basic doubt regarding 
checkpoints. My use case is to calculate the no of unique users by day. I am 
using reduce by key and window for this. Where my window duration is 24 hours 
and slide duration is 5 mins. I am updating the processed record to mongodb. 
Currently I am replace the existing record each time. But I see the memory is 
slowly increasing over time and kills the process after 1 and 1/2 hours(in aws 
small instance). The DB write after the restart clears all the old data. So I 
understand checkpoint is the solution for this. But my doubt is

  *   What should my check point duration be..? As per documentation it says 
5-10 times of slide duration. But I need the data of entire day. So it is ok to 
keep 24 hrs.
  *   Where ideally should the checkpoint be..? Initially when I receive the 
stream or just before the window operation or after the data reduction has 
taken place.

Appreciate your help.
Thank you


View this message in context: spark streaming with 
checkpoint
Sent from the Apache Spark User List mailing list 
archive at Nabble.com.