Look at org.apache.spark.streaming.scheduler.JobGenerator it has a RecurringTimer (timer) that will simply post 'JobGenerate' events to a EventLoop at the batchInterval time.
This EventLoop's thread then picks up these events, uses the streamingContext.graph' to generate a Job (InputDstream's compute method). batchInfo.submissionTime is the time recorded after this generateJob completes. The Job is then sent to the org.apache.spark.streaming.scheduler .JobScheduler who has a ThreadExecutorPool to execute the Job. JobGenerate events are not the only event that gets posted to the JobGenerator.eventLoop. Other events are like DoCheckpoint, ClearCheckpointData, ClearMetadata are also posted and all these events are serviced by the EventLoop's single thread. So for instance if a DoCheckPoint, ClearCheckpointData and ClearMetadata events are queued before your nth JobGenerate event, then there will be a time difference between the batchTime and SubmissionTime for that nth batch thanks Mario On Thu, Mar 10, 2016 at 10:29 AM, Sachin Aggarwal < different.sac...@gmail.com> wrote: Hi cody, let me try once again to explain with example. In BatchInfo class of spark "scheduling delay" is defined as def schedulingDelay: Option[Long] = processingStartTime.map(_ - submissionTime) I am dumping batchinfo object in my LatencyListener which extends StreamingListener. batchTime = 1457424695400 ms submissionTime = 1457425630780 ms difference = 935380 ms can this be considered a lag in processing of events . what is possible explaination for this lag? On Thu, Mar 10, 2016 at 12:22 AM, Cody Koeninger <c...@koeninger.org> wrote: I'm really not sure what you're asking. On Wed, Mar 9, 2016 at 12:43 PM, Sachin Aggarwal <different.sac...@gmail.com> wrote: > where are we capturing this delay? > I am aware of scheduling delay which is defined as processing > time-submission time not the batch create time > > On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Spark streaming by default will not start processing a batch until the >> current batch is finished. So if your processing time is larger than >> your batch time, delays will build up. >> >> On Wed, Mar 9, 2016 at 11:09 AM, Sachin Aggarwal >> <different.sac...@gmail.com> wrote: >> > Hi All, >> > >> > we have batchTime and submissionTime. >> > >> > @param batchTime Time of the batch >> > >> > @param submissionTime Clock time of when jobs of this batch was >> > submitted >> > to the streaming scheduler queue >> > >> > 1) we are seeing difference between batchTime and submissionTime for >> > small >> > batches(300ms) even in minutes for direct kafka this we see, only when >> > the >> > processing time is more than the batch interval. how can we explain this >> > delay?? >> > >> > 2) In one of case batch processing time is more then batch interval, >> > then >> > will spark fetch the next batch data from kafka parallelly processing >> > the >> > current batch or it will wait for current batch to finish first ? >> > >> > I would be thankful if you give me some pointers >> > >> > Thanks! >> > -- >> > >> > Thanks & Regards >> > >> > Sachin Aggarwal >> > 7760502772 > > > > > -- > > Thanks & Regards > > Sachin Aggarwal > 7760502772 -- Thanks & Regards Sachin Aggarwal 7760502772 -- Thanks & Regards Sachin Aggarwal 7760502772