etienne commented on SPARK-17606:

I'm not able to reproduce in local mode. either because the JobGenerator 
managed to restart either because the streaming resulted to a OOM before the 
restarting of the JobGenerator.

After heapdump it appears the memory is full of MapPartitionRDD

I give you here the result of my tries.
batch interval : 500ms , real batch duration > 2s (I had to reduce the batch 
interval to generate batches faster)

I have proceeded simply as : Start the streaming without checkpoint wait until 
checkpoint, stop it, wait a period and restart from the checkpoint.

||stoping time||restarting||batch during down time|| batch pending|| batch to 
reschedule|| starting of JobGenerator || last time* ||
|14:23:01|14:33:29|1320 - [14:22:30-14:33:29.500]|88 - 
[14:22:00-14:22:44]|1379-[14:22:00.500-14:33:29.500]|14:55:00 for time 
|15:08:25|15:31:20|2777 - [15:08:18.500-15:31:26.500]|22 - 
[15:08:11-15:08:21.500]|2792-[15:08:11-15:31:26.500]|OOM| |
|09:30:01|09:47:01|2338 - [09:27:33-09:47:01.500]|298 - 
[09:26:03-09:28:31.500]|2518 - [09:26:03-09:47:01.500]|OOM| |
|12:52:49|12:46:01|1838- [12:49:11.500-13:04:30]|116 - 
[12:49:18.500-12:50:16]|1838- [12:45:11.500-13:04:30]|OOM| |

\* last time to reschedule found in log that was executed before the restarting 
of job generator (strangely there 8 minutes are not missing in UI)

All these OOM make me think there is something that is not cleaned correctly.

The JobGenerator is not started directly after the beginning (I have looked 
into the src and I didn't find what is blocking) and may induce a lag in batch 

> New batches are not created when there are 1000 created after restarting 
> streaming from checkpoint.
> ---------------------------------------------------------------------------------------------------
>                 Key: SPARK-17606
>                 URL: https://issues.apache.org/jira/browse/SPARK-17606
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.6.1
>            Reporter: etienne
> When spark restarts from a checkpoint after being down for a while.
> It recreates missing batch since the down time.
> When there are few missing batches, spark creates new incoming batch every 
> batchTime, but when there is enough missing time to create 1000 batches no 
> new batch is created.
> So when all these batch are completed the stream is idle ...
> I think there is a rigid limit set somewhere.
> I was expecting that spark continue to recreate missed batches, maybe not all 
> at once ( because it's look like it's cause driver memory problem ), and then 
> recreate batches each batchTime.
> Another solution would be to not create missing batches but still restart the 
> direct input.
> Right know for me the only solution to restart a stream after a long break it 
> to remove the checkpoint to allow the creation of a new stream. But losing 
> all my states.
> ps : I'm speaking about direct Kafka input because it's the source I'm 
> currently using, I don't know what happens with other sources.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to