from:"Balakrishnan Narendran"

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran

Yeah use streaming to gather the incoming logs and write to log file then
run a spark job evry 5 minutes to process the counts. Got it. Thanks a
lot.

On 07:07, Mon, 26 Jan 2015 Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com wrote:

 I am a beginner to spark streaming. So have a basic doubt regarding
 checkpoints. My use case is to calculate the no of unique users by day. I
 am using reduce by key and window for this. Where my window duration is 24
 hours and slide duration is 5 mins.

 Adding to what others said, this feels more like a task for run a Spark
 job every five minutes using cron than using the sliding window
 functionality from Spark Streaming.

 Tobias

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark streaming with checkpoint

2015-01-22 Thread Balakrishnan Narendran

Thank you Jerry,
Does the window operation create new RDDs for each slide duration..?
I am asking this because i see a constant increase in memory even when
there is no logs received.
If not checkpoint is there any alternative that you would suggest.?

On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai saisai.s...@intel.com wrote:

Hi,

Seems you have such a large window (24 hours), so the phenomena of memory
increasing is expectable, because of window operation will cache the RDD
within this window in memory. So for your requirement, memory should be
enough to hold the data of 24 hours.

I don’t think checkpoint in Spark Streaming can alleviate such problem,
because checkpoint are mainly for fault tolerance.

Thanks

Jerry

*From:* balu.naren [mailto:balu.na...@gmail.com]
*Sent:* Tuesday, January 20, 2015 7:17 PM
*To:* user@spark.apache.org
*Subject:* spark streaming with checkpoint

I am a beginner to spark streaming. So have a basic doubt regarding
checkpoints. My use case is to calculate the no of unique users by day. I
am using reduce by key and window for this. Where my window duration is 24
hours and slide duration is 5 mins. I am updating the processed record to
mongodb. Currently I am replace the existing record each time. But I see
the memory is slowly increasing over time and kills the process after 1 and
1/2 hours(in aws small instance). The DB write after the restart clears all
the old data. So I understand checkpoint is the solution for this. But my
doubt is

- What should my check point duration be..? As per documentation it
says 5-10 times of slide duration. But I need the data of entire day. So it
is ok to keep 24 hrs.
- Where ideally should the checkpoint be..? Initially when I receive
the stream or just before the window operation or after the data reduction
has taken place.

Appreciate your help.
Thank you
--

View this message in context: spark streaming with checkpoint
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.

Re: spark streaming with checkpoint

Re: spark streaming with checkpoint

Re: spark streaming with checkpoint

3 matches

Site Navigation

Mail list logo

Footer information