Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Sam Elamin Tue, 11 Apr 2017 06:21:07 -0700

Hi Steve


Thanks for the detailed response, I think this problem doesn't have an
industry standard solution as of yet and I am sure a lot of people would
benefit from the discussion

I realise now what you are saying so thanks for clarifying, that said let
me try and explain how we approached the problem

There are 2 problems you highlighted, the first if moving the code from SCM
to prod, and the other is enusiring the data your code uses is correct.
(using the latest data from prod)


*"how do you get your code from SCM into production?"*

We currently have our pipeline being run via airflow, we have our dags in
S3, with regards to how we get our code from SCM to production

1) Jenkins build that builds our spark applications and runs tests
2) Once the first build is successful we trigger another build to copy the
dags to an s3 folder

We then routinely sync this folder to the local airflow dags folder every X
amount of mins

Re test data
*" but what's your strategy for test data: that's always the troublespot."*

Our application is using versioning against the data, so we expect the
source data to be in a certain version and the output data to also be in a
certain version

We have a test resources folder that we have following the same convention
of versioning - this is the data that our application tests use - to ensure
that the data is in the correct format

so for example if we have Table X with version 1 that depends on data from
Table A and B also version 1, we run our spark application then ensure the
transformed table X has the correct columns and row values

Then when we have a new version 2 of the source data or adding a new column
in Table X (version 2), we generate a new version of the data and ensure
the tests are updated

That way we ensure any new version of the data has tests against it

*"I've never seen any good strategy there short of "throw it at a copy of
the production dataset"."*

I agree which is why we have a sample of the production data and version
the schemas we expect the source and target data to look like.

If people are interested I am happy writing a blog about it in the hopes
this helps people build more reliable pipelines

Kind Regards
Sam










On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 7 Apr 2017, at 18:40, Sam Elamin <hussam.ela...@gmail.com> wrote:
>
> Definitely agree with gourav there. I wouldn't want jenkins to run my work
> flow. Seems to me that you would only be using jenkins for its scheduling
> capabilities
>
>
> Maybe I was just looking at this differenlty
>
> Yes you can run tests but you wouldn't want it to run your orchestration
> of jobs
>
> What happens if jenkijs goes down for any particular reason. How do you
> have the conversation with your stakeholders that your pipeline is not
> working and they don't have data because the build server is going through
> an upgrade or going through an upgrade
>
>
>
> Well, I wouldn't use it as a replacement for Oozie, but I'd certainly
> consider as the pipeline for getting your code out to the cluster, so you
> don't have to explain why you just pushed out something broken
>
> As example, here's Renault's pipeline as discussed last week in Munich
> https://flic.kr/p/Tw3Emu
>
> However to be fair I understand what you are saying Steve if someone is in
> a place where you only have access to jenkins and have to go through hoops
> to setup:get access to new instances then engineers will do what they
> always do, find ways to game the system to get their work done
>
>
>
>
> This isn't about trying to "Game the system", this is about what makes a
> replicable workflow for getting code into production, either at the press
> of a button or as part of a scheduled "we push out an update every night,
> rerun the deployment tests and then switch over to the new installation"
> mech.
>
> Put differently: how do you get your code from SCM into production? Not
> just for CI, but what's your strategy for test data: that's always the
> troublespot. Random selection of rows may work, although it will skip the
> odd outlier (high-unicode char in what should be a LATIN-1 field, time set
> to 0, etc), and for work joining > 1 table, you need rows which join well.
> I've never seen any good strategy there short of "throw it at a copy of the
> production dataset".
>
>
> -Steve
>
>
>
>
>
>
> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi Steve,
>>
>> Why would you ever do that? You are suggesting the use of a CI tool as a
>> workflow and orchestration engine.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>> If you have Jenkins set up for some CI workflow, that can do scheduled
>>> builds and tests. Works well if you can do some build test before even
>>> submitting it to a remote cluster
>>>
>>> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote:
>>>
>>> Hi Shyla
>>>
>>> You have multiple options really some of which have been already listed
>>> but let me try and clarify
>>>
>>> Assuming you have a spark application in a jar you have a variety of
>>> options
>>>
>>> You have to have an existing spark cluster that is either running on EMR
>>> or somewhere else.
>>>
>>> *Super simple / hacky*
>>> Cron job on EC2 that calls a simple shell script that does a spart
>>> submit to a Spark Cluster OR create or add step to an EMR cluster
>>>
>>> *More Elegant*
>>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that
>>> will do the above step but have scheduling and potential backfilling and
>>> error handling(retries,alerts etc)
>>>
>>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>>> does some Spark jobs but I do not think its available worldwide just yet
>>>
>>> Hope I cleared things up
>>>
>>> Regards
>>> Sam
>>>
>>>
>>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi Shyla,
>>>>
>>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>>> deshpandesh...@gmail.com> wrote:
>>>>
>>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>>>> easiest way to do this. Thanks
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Reply via email to