Folks, I've been building out a large machine learning repository using spark as the compute platform running on yarn and hadoop, I was wondering if folks have some best practice oriented thoughts around unit testing/integration testing this application, I am using spark-submit and a configuration file to enable a dynamic workflow such that we can build different ML repos for each of our models. The ML repos consist of parquet files and eventually hive tables.I want to be able to unit test this application using scalatest or some other recommended utility, I also want to integration test the application in our int environment, specifically we have a dev/int and eventually prod and a prod environment consisting of spark running on hadoop usign yarn.
The ideal workflow in my mind would be:</div> 1) unit tests run upon every checkin in our dev enviroment</div> 2) application gets propagated to our int environment</div> 3) integration tests run successfully in our int environment</div> 4) application gets propagated to our prod environment</div> 5) hive table/parquet file gets generated and consumed by scala notebooks running on top of spark cluster</div> **Caveat I wasnt sure if this was more appropriate for dev or user mailing list but given that I only am following dev I sent this here. Best Regards