smrosenberry commented on issue #23926: [SPARK-26872][STREAMING] Use a 
configurable value for final termination in the JobScheduler.stop() method
URL: https://github.com/apache/spark/pull/23926#issuecomment-468959856
 
 
   The application processes streaming data from Kafka 24/7.  The file 
processing is a backup mechanism for those "rare" occasions when something goes 
bump in the night and downstream processing fails.  We manually run the same 
application to pick up the missing output by processing the raw input files 
that were saved while processing the streaming data.
   
   We have had issues with the manual process using 
`StreamingContext.textFileStream()` including the sheer number of files, amount 
of data, and time to copy the raw input files to the directory being monitored 
by the `textFileStream`.
   
   The single batch file data technique I outlined previously allows us to 
again use the same application to process the input data but now by reading the 
input data directly from where it sits without the moving and file watching.
   
   The longer goal is to restructure the application architecture so either a 
`DStream` or `DataFrame` pipeline can be built: `DStream` for Kafka streaming 
processing, `Dataframe` for file input processing.  Unfortunately, the current 
architecture is such that DStream parameters are passed widely and deeply, and 
restructuring will not be a quick, easy effort.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to