smrosenberry commented on issue #23926: [SPARK-26872][STREAMING] Use a configurable value for final termination in the JobScheduler.stop() method URL: https://github.com/apache/spark/pull/23926#issuecomment-468943623 Basically, I found I could process a single batch of file input data through a streaming pipeline by: 1. Preloading the streaming context queue with an RDD of the records from the file(s): `StreamingContext.queueStream(queue,false)` 2. Starting the streaming context: `StreamingContext.start()` 3. Immediately and gracefully stopping the streaming context: `StreamingContext.stop(true,true)` The batch interval, not unexpectedly, determines when the first (and in my case only) batch actually begins processing. Since I'm impatient (and who among us isn't?), my batch interval is 1 millisecond. Processing begins immediately. Based upon the size of the input file, my expectation is to set the new spark.streaming.jobTimeout value to twice the guestimated run time. I expect my jobs to run for hours, not days. While specifying the jobTimeout in units of hours is acceptable, it may not be granular enough for other potential use cases. Specifying the timeout in minutes feels like the proper compromise between flexibility and awkwardly large numbers.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
