smrosenberry commented on issue #23926: [SPARK-26872][STREAMING] Use a configurable value for final termination in the JobScheduler.stop() method URL: https://github.com/apache/spark/pull/23926#issuecomment-468959856 The application processes streaming data from Kafka 24/7. The file processing is a backup mechanism for those "rare" occasions when something goes bump in the night and downstream processing fails. We manually run the same application to pick up the missing output by processing the raw input files that were saved while processing the streaming data. We have had issues with the manual process using `StreamingContext.textFileStream()` including the sheer number of files, amount of data, and time to copy the raw input files to the directory being monitored by the `textFileStream`. The single batch file data technique I outlined previously allows us to again use the same application to process the input data but now by reading the input data directly from where it sits without the moving and file watching. The longer goal is to restructure the application architecture so either a `DStream` or `DataFrame` pipeline can be built: `DStream` for Kafka streaming processing, `Dataframe` for file input processing. Unfortunately, the current architecture is such that DStream parameters are passed widely and deeply, and restructuring will not be a quick, easy effort.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
