Re: Airflow Testing Library

Gerard Toonstra Thu, 01 Jun 2017 12:06:06 -0700

Hi all,

Is there progress made on this effort?  I think it's really interesting and
very important for
driving further adoption of apache airflow.



small plug: I started a public repo to test out my ideas as written in the
thread:

https://github.com/gtoonstra/airflow-hovercraft

The "reference implementation" may be a bit too ambitious, but we can see
where it goes.
The intention is to explore what engineers would run into when dealing with
airflow and
I think having a solid testing approach is paramount in delivering reliably
and timely. I'm not
intending to copy core/contrib stuff that is there, I'm mostly wrapping
them to be able to
instrument them and run the tests in the way as envisioned: against real
backends and with
simulated data at higher levels of workflow execution using python behavior
testing.


Anyone willing to participate, privmsg me if you're interested and I can
add you.

Rgds,

Gerard


On Thu, May 18, 2017 at 2:00 PM, Gerard Toonstra <[email protected]>
wrote:

>
>> On Tue, May 9, 2017 at 9:46 PM, Arthur Wiedmer <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I would love to see if we can contribute some of the work we have done
>>> internally at Airbnb to support some testing of DAGs. We have a long ways
>>> to go though :)
>>>
>>> Best,
>>> Arthur
>>>
>>> On Tue, May 9, 2017 at 12:34 PM, Sam Elamin <[email protected]>
>>> wrote:
>>>
>>> > Thanks Gerard and Laura, I have created an email thread as agreed in
>>> the
>>> > call so lets take the discussion there. If anyone else is interested in
>>> > helping us build this library please do get in touch!
>>> >
>>> > On Tue, May 9, 2017 at 5:40 PM, Laura Lorenz <[email protected]
>>> >
>>> > wrote:
>>> >
>>> > > Good points @Gerard. I think the distinctions you make between
>>> different
>>> > > testing considerations could help us focus our efforts. Here's my 2
>>> cents
>>> > > in the buckets you describe; I'm wondering if any of these use cases
>>> > align
>>> > > with anyone else and can help narrow our scope, and if I understood
>>> you
>>> > > right @Gerard:
>>> > >
>>> > > Regarding platform code: For our own platform code (ie custom
>>> Operators
>>> > and
>>> > > Hooks), we have our CI platform running unittests on their
>>> construction
>>> > > and, in the case of hooks, integration tests on connectivity. The
>>> latter
>>> > > involves us setting up test integration services (i.e. a test MySQL
>>> > > process) which we start up as docker containers and we flip our
>>> airflow's
>>> > > configuration to point at them during testing using environment
>>> > variables.
>>> > > It seems from a browse on airflow's testing that operators and hooks
>>> are
>>> > > mostly unittested, with the integrations mocked or skipped (ie
>>> > > https://github.com/apache/incubator-airflow/blob/master/
>>> > > tests/contrib/hooks/test_jira_hook.py#L40-L41
>>> > > or
>>> > > https://github.com/apache/incubator-airflow/blob/master/
>>> > > tests/contrib/hooks/test_sqoop_hook.py#L123-L125).
>>> > > If the hook is using some other, well tested library to actually
>>> > establish
>>> > > the connection, the case can probably be made here that the custom
>>> > operator
>>> > > and hook authors don't need integration tests, so since the normal
>>> > unittest
>>> > > library is enough to handle these that might not need to be in scope
>>> for
>>> > a
>>> > > new testing library to describe.
>>> > >
>>> > > Regarding data manipulation functions of the business code:
>>> > > For us, we run tests on each operator in each DAG on CI, seeded with
>>> test
>>> > > input data, asserted against known output data, all of which we have
>>> > > compiled over time to represent different edge cases we expect or
>>> have
>>> > > seen. So this is a test at the level of the operator as described in
>>> a
>>> > > given DAG. Because we only describe edge cases we have seen or can
>>> > predict,
>>> > > its a very reactive way to handle testing at this level.
>>> > >
>>> > > If I understand your idea right, another way to test (or at least,
>>> > surface
>>> > > errors) at this level is, given you have a DAG that is resilient
>>> against
>>> > > arbitrary data failures, your DAG should include a validation
>>> task/report
>>> > > at its end or a test suite should run daily against the production
>>> error
>>> > > log for that DAG that surfaces errors your business code encountered
>>> on
>>> > > production data. I think this is really interesting and reminds me
>>> of an
>>> > > airflow video I saw once (can't remember who gave the talk) on a DAG
>>> > whose
>>> > > last task self-reported error counts and rows lost. If implemented
>>> as a
>>> > > test suite you would run against production this might be a
>>> direction we
>>> > > would want a testing library to go into.
>>> > >
>>> > > Regarding the workflow correctness of the business code:
>>> > > What we set out to do on our side was a hybrid version of your item 1
>>> > and 2
>>> > > which we call "end-to-end tests": to call a whole DAG against 'real'
>>> > > existing systems (though really they are test docker containers of
>>> the
>>> > > processes we need (MySQL and Neo4J specifically) that we use
>>> environment
>>> > > variables to switch our airflow to use when instantiating hooks etc),
>>> > > seeded with test input files for services that are hard to set up
>>> (i.e.
>>> > > third party APIs we ingest data from). Since the whole DAG is seeded
>>> with
>>> > > known input data, this gives us a way to compare the last output of
>>> a DAG
>>> > > to a known file, so that if any workflow changes OR business logic
>>> in the
>>> > > middle affected the final output, we would know as part of our test
>>> suite
>>> > > instead of when production breaks. In other words, a way to test a
>>> > > regression of the whole DAG. So this is the framework we were
>>> thinking
>>> > > needed to be created, and is a direction we could go with a testing
>>> > library
>>> > > as well.
>>> > >
>>> > > This doesn't get to your point of determining what workflow was used,
>>> > which
>>> > > is interesting, just not a use case we have encountered yet (we only
>>> have
>>> > > deterministic DAGs). In my mind in this case we would want a testing
>>> > suite
>>> > > to be able to more or less turn some DAGs "on" against seeded input
>>> data
>>> > > and mocked or test integration services, let a scheduler go at it,
>>> and
>>> > then
>>> > > check the metadata database for what workflow happened (and, if we
>>> had
>>> > test
>>> > > integration services, maybe also check the output against the known
>>> > output
>>> > > for the seeded input). I can definitely see your suggestion of
>>> developing
>>> > > instrumentation to inspect a followed workflow as a useful addition a
>>> > > testing library could include.
>>> > >
>>> > > To some degree our end-to-end DAG tests overlaps in our workflow with
>>> > your
>>> > > point 3 (UAT environment), but we've found that more useful to test
>>> if
>>> > > "wild data" causes uncaught exceptions or any integration errors with
>>> > > difficult-to-mock third party services, not DAG level logic
>>> regressions,
>>> > > since the input data is unknown and thus we can't compare to a known
>>> > output
>>> > > in this case, depending instead on a fallible human QA or just
>>> accepting
>>> > > that the DAG running with no exceptions as passing UAT.
>>> > >
>>> > > Laura
>>> > >
>>> > > On Tue, May 9, 2017 at 2:15 AM, Gerard Toonstra <[email protected]
>>> >
>>> > > wrote:
>>> > >
>>> > > > Very interesting video. I was unable to take part. I watched only
>>> part
>>> > of
>>> > > > it for now.
>>> > > > Let us know where the discussion is being moved to.
>>> > > >
>>> > > > The confluence does indeed seem to be the place to put final
>>> > conclusions
>>> > > > and thoughts.
>>> > > >
>>> > > > For airflow, I like to make a distinction between "platform" and
>>> > > "business"
>>> > > > code. The platform code are
>>> > > > the hooks and operators and provide the capabilities of what your
>>> ETL
>>> > > > system can do. You'll test this
>>> > > > code with a lot of thoroughness, such that each component behaves
>>> how
>>> > > you'd
>>> > > > expect, judging from
>>> > > > the constructor interface. Any abstractions in there (like copying
>>> > files
>>> > > to
>>> > > > GCS) should be kept as hidden
>>> > > > as possible (retries, etc).
>>> > > >
>>> > > > The "business" code is what runs on a daily basis. This can be
>>> divided
>>> > in
>>> > > > another two concerns
>>> > > > for testing:
>>> > > >
>>> > > > 1 The workflow, the code between the data manipulation functions
>>> that
>>> > > > decides which operators get called
>>> > > > 2 The data manipulation function.
>>> > > >
>>> > > >
>>> > > > I think it's good practice to run tests on "2" on a daily basis
>>> and not
>>> > > > just once on CI. The reason is that there
>>> > > > are too many unforeseen circumstances where data can get into a bad
>>> > > state.
>>> > > > So such tests shouldn't run
>>> > > > once on a highly controlled environment like CI, but run daily in a
>>> > less
>>> > > > predictable environment like production,
>>> > > > where all kind of weird things can happen, but you'll be able to
>>> catch
>>> > > with
>>> > > > proper checks in place. Even if the checks
>>> > > > are too rigorous, you can skip them and improve on them, so that it
>>> > fits
>>> > > > what goes on in your environment
>>> > > > to your best ability.
>>> > > >
>>> > > >
>>> > > > Which mostly leaves testing workflow correctness and platform code.
>>> > What
>>> > > I
>>> > > > had intended to do was;
>>> > > >
>>> > > > 1. Test the platform code against real existing systems (or maybe
>>> > docker
>>> > > > containers), to test their behavior
>>> > > >     in success and failure conditions.
>>> > > > 2. Create workflow scripts for testing the workflow; this probably
>>> > > requires
>>> > > > some specific changes in hooks,
>>> > > >    which wouldn't call out to other systems, but would just pick up
>>> > small
>>> > > > files you prepare from a testing repo
>>> > > >    and pass them around. The test script could also simulate
>>> > > > unavailability, etc.
>>> > > >    This relieves you of a huge responsibility of setting up
>>> systems,
>>> > > docker
>>> > > > containers and load that with data.
>>> > > >     Airflow sets up pretty quickly as a docker container and you
>>> can
>>> > also
>>> > > > start up a sample database with that.
>>> > > >     Afterwards, from a test script, you can check which workflow
>>> was
>>> > > > followed by inspecting the database,
>>> > > >    so develop some instrumentation for that.
>>> > > > 3. Test the data manipulation in a UAT environment, mirrorring the
>>> runs
>>> > > in
>>> > > > production to some extent.
>>> > > >     That would be a place to verify if the data comes out
>>> correctly and
>>> > > > also show people what kind of
>>> > > >    monitoring is in place to double-check that.
>>> > > >
>>> > > >
>>> > > > On Tue, May 9, 2017 at 1:14 AM, Arnie Salazar <
>>> [email protected]>
>>> > > > wrote:
>>> > > >
>>> > > > > Scratch that. I see the whole video now.
>>> > > > >
>>> > > > > On Mon, May 8, 2017 at 3:33 PM Arnie Salazar <
>>> [email protected]
>>> > >
>>> > > > > wrote:
>>> > > > >
>>> > > > > > Thanks Sam!
>>> > > > > >
>>> > > > > > Is there a part 2 to the video? If not, can you post the "next
>>> > steps"
>>> > > > > > notes you took whenever you have a chance?
>>> > > > > >
>>> > > > > > Cheers,
>>> > > > > > Arnie
>>> > > > > >
>>> > > > > > On Mon, May 8, 2017 at 3:08 PM Sam Elamin <
>>> [email protected]
>>> > >
>>> > > > > wrote:
>>> > > > > >
>>> > > > > >> Hi Folks
>>> > > > > >>
>>> > > > > >> For those of you who missed it, you can catch the discussion
>>> from
>>> > > the
>>> > > > > link
>>> > > > > >> on this tweet <https://twitter.com/samelamin/status/
>>> > > > 861703888298225670>
>>> > > > > >>
>>> > > > > >> Please do share and feel free to get involved as the more
>>> feedback
>>> > > we
>>> > > > > get
>>> > > > > >> the better the library we create is :)
>>> > > > > >>
>>> > > > > >> Regards
>>> > > > > >> Sam
>>> > > > > >>
>>> > > > > >> On Mon, May 8, 2017 at 9:43 PM, Sam Elamin <
>>> > [email protected]
>>> > > >
>>> > > > > >> wrote:
>>> > > > > >>
>>> > > > > >> > Bit late notice but the call is happening today at 9 15 utc
>>> so
>>> > in
>>> > > > > about
>>> > > > > >> >  30 mins or so
>>> > > > > >> >
>>> > > > > >> > It will be recorded but if anyone would like to join in on
>>> the
>>> > > > > >> discussion
>>> > > > > >> > the hangout link is https://hangouts.google.com/hangouts/_/
>>> > > > > >> > mbkr6xassnahjjonpuvrirxbnae
>>> > > > > >> >
>>> > > > > >> > Regards
>>> > > > > >> > Sam
>>> > > > > >> >
>>> > > > > >> > On Fri, 5 May 2017 at 21:35, Ali Uz <[email protected]>
>>> wrote:
>>> > > > > >> >
>>> > > > > >> >> I am also very interested in seeing how this turns out.
>>> Even
>>> > > though
>>> > > > > we
>>> > > > > >> >> don't have a testing framework in-place on the project I am
>>> > > working
>>> > > > > >> on, I
>>> > > > > >> >> would very much like to contribute to some general
>>> framework
>>> > for
>>> > > > > >> testing
>>> > > > > >> >> DAGs.
>>> > > > > >> >>
>>> > > > > >> >> As of now we are just implementing dummy tasks that test
>>> our
>>> > > actual
>>> > > > > >> tasks
>>> > > > > >> >> and verify if the given input produces the expected output.
>>> > > Nothing
>>> > > > > >> crazy
>>> > > > > >> >> and certainly not flexible in the long run.
>>> > > > > >> >>
>>> > > > > >> >>
>>> > > > > >> >> On Fri, 5 May 2017 at 22:59, Sam Elamin <
>>> > [email protected]
>>> > > >
>>> > > > > >> wrote:
>>> > > > > >> >>
>>> > > > > >> >> > Haha yes Scott you are in!
>>> > > > > >> >> > On Fri, 5 May 2017 at 20:07, Scott Halgrim <
>>> > > > > [email protected]
>>> > > > > >> >
>>> > > > > >> >> > wrote:
>>> > > > > >> >> >
>>> > > > > >> >> > > Sounds A+ to me. By “both of you” did you include me?
>>> My
>>> > > first
>>> > > > > >> >> response
>>> > > > > >> >> > > was just to your email address.
>>> > > > > >> >> > >
>>> > > > > >> >> > > On May 5, 2017, 11:58 AM -0700, Sam Elamin <
>>> > > > > >> [email protected]>,
>>> > > > > >> >> > > wrote:
>>> > > > > >> >> > > > Ok sounds great folks
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > Thanks for the detailed response laura! I'll invite
>>> both
>>> > of
>>> > > > you
>>> > > > > >> to
>>> > > > > >> >> the
>>> > > > > >> >> > > > group if you are happy and we can schedule a call for
>>> > next
>>> > > > > week?
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > How does that sound?
>>> > > > > >> >> > > > On Fri, 5 May 2017 at 17:41, Laura Lorenz <
>>> > > > > >> [email protected]
>>> > > > > >> >> >
>>> > > > > >> >> > > wrote:
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > > We do! We developed our own little in-house DAG
>>> test
>>> > > > > framework
>>> > > > > >> >> which
>>> > > > > >> >> > we
>>> > > > > >> >> > > > > could share insights on/would love to hear what
>>> other
>>> > > folks
>>> > > > > >> are up
>>> > > > > >> >> > to.
>>> > > > > >> >> > > > > Basically we use mock a DAG's input data, use the
>>> > > > BackfillJob
>>> > > > > >> API
>>> > > > > >> >> > > directly
>>> > > > > >> >> > > > > to call a DAG in a test, and compare its outputs
>>> to the
>>> > > > > >> intended
>>> > > > > >> >> > result
>>> > > > > >> >> > > > > given the inputs. We use docker/docker-compose to
>>> > manage
>>> > > > > >> services,
>>> > > > > >> >> > and
>>> > > > > >> >> > > > > split our dev and test stack locally so that the
>>> tests
>>> > > have
>>> > > > > >> their
>>> > > > > >> >> own
>>> > > > > >> >> > > > > scheduler and metadata database and so that our CI
>>> tool
>>> > > > knows
>>> > > > > >> how
>>> > > > > >> >> to
>>> > > > > >> >> > > > > construct the test stack as well.
>>> > > > > >> >> > > > >
>>> > > > > >> >> > > > > We co-opted the BackfillJob API for our own
>>> purposes
>>> > > here,
>>> > > > > but
>>> > > > > >> it
>>> > > > > >> >> > > seemed
>>> > > > > >> >> > > > > overly complicated and fragile to start and
>>> interact
>>> > with
>>> > > > our
>>> > > > > >> own
>>> > > > > >> >> > > > > in-test-process executor like we saw in a few of
>>> the
>>> > > tests
>>> > > > in
>>> > > > > >> the
>>> > > > > >> >> > > Airflow
>>> > > > > >> >> > > > > test suite. So I'd be really interested on finding
>>> a
>>> > way
>>> > > to
>>> > > > > >> >> > streamline
>>> > > > > >> >> > > how
>>> > > > > >> >> > > > > to describe a test executor for both the Airflow
>>> test
>>> > > suite
>>> > > > > and
>>> > > > > >> >> > > people's
>>> > > > > >> >> > > > > own DAG testing and make that a first class type of
>>> > API.
>>> > > > > >> >> > > > >
>>> > > > > >> >> > > > > Laura
>>> > > > > >> >> > > > >
>>> > > > > >> >> > > > > On Fri, May 5, 2017 at 11:46 AM, Sam Elamin <
>>> > > > > >> >> [email protected]
>>> > > > > >> >> > > > > wrote:
>>> > > > > >> >> > > > >
>>> > > > > >> >> > > > > > Hi All
>>> > > > > >> >> > > > > >
>>> > > > > >> >> > > > > > A few people in the Spark community are
>>> interested in
>>> > > > > >> writing a
>>> > > > > >> >> > > testing
>>> > > > > >> >> > > > > > library for Airflow. We would love anyone who
>>> uses
>>> > > > Airflow
>>> > > > > >> >> heavily
>>> > > > > >> >> > in
>>> > > > > >> >> > > > > > production to be involved
>>> > > > > >> >> > > > > >
>>> > > > > >> >> > > > > > At the moment (AFAIK) testing your DAGs is a bit
>>> of a
>>> > > > pain,
>>> > > > > >> >> > > especially if
>>> > > > > >> >> > > > > > you want to run them in a CI server
>>> > > > > >> >> > > > > >
>>> > > > > >> >> > > > > > Is anyone interested in being involved in the
>>> > > discussion?
>>> > > > > >> >> > > > > >
>>> > > > > >> >> > > > > > Kind Regards
>>> > > > > >> >> > > > > > Sam
>>> > > > > >> >> > > > > >
>>> > > > > >> >> > > > >
>>> > > > > >> >> > >
>>> > > > > >> >> >
>>> > > > > >> >>
>>> > > > > >> >
>>> > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Airflow Testing Library

Reply via email to