Hi Steve
Thanks for the detailed response, I think this problem doesn't have an industry standard solution as of yet and I am sure a lot of people would benefit from the discussion I realise now what you are saying so thanks for clarifying, that said let me try and explain how we approached the problem There are 2 problems you highlighted, the first if moving the code from SCM to prod, and the other is enusiring the data your code uses is correct. (using the latest data from prod) *"how do you get your code from SCM into production?"* We currently have our pipeline being run via airflow, we have our dags in S3, with regards to how we get our code from SCM to production 1) Jenkins build that builds our spark applications and runs tests 2) Once the first build is successful we trigger another build to copy the dags to an s3 folder We then routinely sync this folder to the local airflow dags folder every X amount of mins Re test data *" but what's your strategy for test data: that's always the troublespot."* Our application is using versioning against the data, so we expect the source data to be in a certain version and the output data to also be in a certain version We have a test resources folder that we have following the same convention of versioning - this is the data that our application tests use - to ensure that the data is in the correct format so for example if we have Table X with version 1 that depends on data from Table A and B also version 1, we run our spark application then ensure the transformed table X has the correct columns and row values Then when we have a new version 2 of the source data or adding a new column in Table X (version 2), we generate a new version of the data and ensure the tests are updated That way we ensure any new version of the data has tests against it *"I've never seen any good strategy there short of "throw it at a copy of the production dataset"."* I agree which is why we have a sample of the production data and version the schemas we expect the source and target data to look like. If people are interested I am happy writing a blog about it in the hopes this helps people build more reliable pipelines Kind Regards Sam On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > On 7 Apr 2017, at 18:40, Sam Elamin <hussam.ela...@gmail.com> wrote: > > Definitely agree with gourav there. I wouldn't want jenkins to run my work > flow. Seems to me that you would only be using jenkins for its scheduling > capabilities > > > Maybe I was just looking at this differenlty > > Yes you can run tests but you wouldn't want it to run your orchestration > of jobs > > What happens if jenkijs goes down for any particular reason. How do you > have the conversation with your stakeholders that your pipeline is not > working and they don't have data because the build server is going through > an upgrade or going through an upgrade > > > > Well, I wouldn't use it as a replacement for Oozie, but I'd certainly > consider as the pipeline for getting your code out to the cluster, so you > don't have to explain why you just pushed out something broken > > As example, here's Renault's pipeline as discussed last week in Munich > https://flic.kr/p/Tw3Emu > > However to be fair I understand what you are saying Steve if someone is in > a place where you only have access to jenkins and have to go through hoops > to setup:get access to new instances then engineers will do what they > always do, find ways to game the system to get their work done > > > > > This isn't about trying to "Game the system", this is about what makes a > replicable workflow for getting code into production, either at the press > of a button or as part of a scheduled "we push out an update every night, > rerun the deployment tests and then switch over to the new installation" > mech. > > Put differently: how do you get your code from SCM into production? Not > just for CI, but what's your strategy for test data: that's always the > troublespot. Random selection of rows may work, although it will skip the > odd outlier (high-unicode char in what should be a LATIN-1 field, time set > to 0, etc), and for work joining > 1 table, you need rows which join well. > I've never seen any good strategy there short of "throw it at a copy of the > production dataset". > > > -Steve > > > > > > > On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi Steve, >> >> Why would you ever do that? You are suggesting the use of a CI tool as a >> workflow and orchestration engine. >> >> Regards, >> Gourav Sengupta >> >> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com> >> wrote: >> >>> If you have Jenkins set up for some CI workflow, that can do scheduled >>> builds and tests. Works well if you can do some build test before even >>> submitting it to a remote cluster >>> >>> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote: >>> >>> Hi Shyla >>> >>> You have multiple options really some of which have been already listed >>> but let me try and clarify >>> >>> Assuming you have a spark application in a jar you have a variety of >>> options >>> >>> You have to have an existing spark cluster that is either running on EMR >>> or somewhere else. >>> >>> *Super simple / hacky* >>> Cron job on EC2 that calls a simple shell script that does a spart >>> submit to a Spark Cluster OR create or add step to an EMR cluster >>> >>> *More Elegant* >>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that >>> will do the above step but have scheduling and potential backfilling and >>> error handling(retries,alerts etc) >>> >>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that >>> does some Spark jobs but I do not think its available worldwide just yet >>> >>> Hope I cleared things up >>> >>> Regards >>> Sam >>> >>> >>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta < >>> gourav.sengu...@gmail.com> wrote: >>> >>>> Hi Shyla, >>>> >>>> why would you want to schedule a spark job in EC2 instead of EMR? >>>> >>>> Regards, >>>> Gourav >>>> >>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande < >>>> deshpandesh...@gmail.com> wrote: >>>> >>>>> I want to run a spark batch job maybe hourly on AWS EC2 . What is the >>>>> easiest way to do this. Thanks >>>>> >>>> >>>> >>> >>> >> >