Folks,
I've been building out a large machine learning repository using spark as the 
compute platform running on yarn and hadoop, I was wondering if folks have some 
best practice oriented thoughts around unit testing/integration testing this 
application, I am using spark-submit and a configuration file to enable a 
dynamic workflow such that we can build different ML repos for each of our 
models. The ML repos consist of parquet files and eventually hive tables.I want 
to be able to unit test this application using scalatest or some other 
recommended utility, I also want to integration test the application in our int 
environment, specifically we have a dev/int and eventually prod and a prod 
environment consisting of spark running on hadoop usign yarn.


The ideal workflow in my mind would be:</div>
1) unit tests run upon every checkin in our dev enviroment</div>
2) application gets propagated to our int environment</div>
3) integration tests run successfully in our int environment</div>
4) application gets propagated to our prod environment</div>
5) hive table/parquet file gets generated and consumed by scala notebooks 
running on top of spark cluster</div>


**Caveat I wasnt sure if this was more appropriate for dev or user mailing list 
but given that I only am following dev I sent this here.


Best Regards

Reply via email to