Very interesting video. I was unable to take part. I watched only part of
it for now.
Let us know where the discussion is being moved to.
The confluence does indeed seem to be the place to put final conclusions
and thoughts.
For airflow, I like to make a distinction between "platform" and "business"
code. The platform code are
the hooks and operators and provide the capabilities of what your ETL
system can do. You'll test this
code with a lot of thoroughness, such that each component behaves how you'd
expect, judging from
the constructor interface. Any abstractions in there (like copying files to
GCS) should be kept as hidden
as possible (retries, etc).
The "business" code is what runs on a daily basis. This can be divided in
another two concerns
for testing:
1 The workflow, the code between the data manipulation functions that
decides which operators get called
2 The data manipulation function.
I think it's good practice to run tests on "2" on a daily basis and not
just once on CI. The reason is that there
are too many unforeseen circumstances where data can get into a bad state.
So such tests shouldn't run
once on a highly controlled environment like CI, but run daily in a less
predictable environment like production,
where all kind of weird things can happen, but you'll be able to catch with
proper checks in place. Even if the checks
are too rigorous, you can skip them and improve on them, so that it fits
what goes on in your environment
to your best ability.
Which mostly leaves testing workflow correctness and platform code. What I
had intended to do was;
1. Test the platform code against real existing systems (or maybe docker
containers), to test their behavior
in success and failure conditions.
2. Create workflow scripts for testing the workflow; this probably requires
some specific changes in hooks,
which wouldn't call out to other systems, but would just pick up small
files you prepare from a testing repo
and pass them around. The test script could also simulate
unavailability, etc.
This relieves you of a huge responsibility of setting up systems, docker
containers and load that with data.
Airflow sets up pretty quickly as a docker container and you can also
start up a sample database with that.
Afterwards, from a test script, you can check which workflow was
followed by inspecting the database,
so develop some instrumentation for that.
3. Test the data manipulation in a UAT environment, mirrorring the runs in
production to some extent.
That would be a place to verify if the data comes out correctly and
also show people what kind of
monitoring is in place to double-check that.
On Tue, May 9, 2017 at 1:14 AM, Arnie Salazar <[email protected]>
wrote:
> Scratch that. I see the whole video now.
>
> On Mon, May 8, 2017 at 3:33 PM Arnie Salazar <[email protected]>
> wrote:
>
> > Thanks Sam!
> >
> > Is there a part 2 to the video? If not, can you post the "next steps"
> > notes you took whenever you have a chance?
> >
> > Cheers,
> > Arnie
> >
> > On Mon, May 8, 2017 at 3:08 PM Sam Elamin <[email protected]>
> wrote:
> >
> >> Hi Folks
> >>
> >> For those of you who missed it, you can catch the discussion from the
> link
> >> on this tweet <https://twitter.com/samelamin/status/861703888298225670>
> >>
> >> Please do share and feel free to get involved as the more feedback we
> get
> >> the better the library we create is :)
> >>
> >> Regards
> >> Sam
> >>
> >> On Mon, May 8, 2017 at 9:43 PM, Sam Elamin <[email protected]>
> >> wrote:
> >>
> >> > Bit late notice but the call is happening today at 9 15 utc so in
> about
> >> > 30 mins or so
> >> >
> >> > It will be recorded but if anyone would like to join in on the
> >> discussion
> >> > the hangout link is https://hangouts.google.com/hangouts/_/
> >> > mbkr6xassnahjjonpuvrirxbnae
> >> >
> >> > Regards
> >> > Sam
> >> >
> >> > On Fri, 5 May 2017 at 21:35, Ali Uz <[email protected]> wrote:
> >> >
> >> >> I am also very interested in seeing how this turns out. Even though
> we
> >> >> don't have a testing framework in-place on the project I am working
> >> on, I
> >> >> would very much like to contribute to some general framework for
> >> testing
> >> >> DAGs.
> >> >>
> >> >> As of now we are just implementing dummy tasks that test our actual
> >> tasks
> >> >> and verify if the given input produces the expected output. Nothing
> >> crazy
> >> >> and certainly not flexible in the long run.
> >> >>
> >> >>
> >> >> On Fri, 5 May 2017 at 22:59, Sam Elamin <[email protected]>
> >> wrote:
> >> >>
> >> >> > Haha yes Scott you are in!
> >> >> > On Fri, 5 May 2017 at 20:07, Scott Halgrim <
> [email protected]
> >> >
> >> >> > wrote:
> >> >> >
> >> >> > > Sounds A+ to me. By “both of you” did you include me? My first
> >> >> response
> >> >> > > was just to your email address.
> >> >> > >
> >> >> > > On May 5, 2017, 11:58 AM -0700, Sam Elamin <
> >> [email protected]>,
> >> >> > > wrote:
> >> >> > > > Ok sounds great folks
> >> >> > > >
> >> >> > > > Thanks for the detailed response laura! I'll invite both of you
> >> to
> >> >> the
> >> >> > > > group if you are happy and we can schedule a call for next
> week?
> >> >> > > >
> >> >> > > > How does that sound?
> >> >> > > > On Fri, 5 May 2017 at 17:41, Laura Lorenz <
> >> [email protected]
> >> >> >
> >> >> > > wrote:
> >> >> > > >
> >> >> > > > > We do! We developed our own little in-house DAG test
> framework
> >> >> which
> >> >> > we
> >> >> > > > > could share insights on/would love to hear what other folks
> >> are up
> >> >> > to.
> >> >> > > > > Basically we use mock a DAG's input data, use the BackfillJob
> >> API
> >> >> > > directly
> >> >> > > > > to call a DAG in a test, and compare its outputs to the
> >> intended
> >> >> > result
> >> >> > > > > given the inputs. We use docker/docker-compose to manage
> >> services,
> >> >> > and
> >> >> > > > > split our dev and test stack locally so that the tests have
> >> their
> >> >> own
> >> >> > > > > scheduler and metadata database and so that our CI tool knows
> >> how
> >> >> to
> >> >> > > > > construct the test stack as well.
> >> >> > > > >
> >> >> > > > > We co-opted the BackfillJob API for our own purposes here,
> but
> >> it
> >> >> > > seemed
> >> >> > > > > overly complicated and fragile to start and interact with our
> >> own
> >> >> > > > > in-test-process executor like we saw in a few of the tests in
> >> the
> >> >> > > Airflow
> >> >> > > > > test suite. So I'd be really interested on finding a way to
> >> >> > streamline
> >> >> > > how
> >> >> > > > > to describe a test executor for both the Airflow test suite
> and
> >> >> > > people's
> >> >> > > > > own DAG testing and make that a first class type of API.
> >> >> > > > >
> >> >> > > > > Laura
> >> >> > > > >
> >> >> > > > > On Fri, May 5, 2017 at 11:46 AM, Sam Elamin <
> >> >> [email protected]
> >> >> > > > > wrote:
> >> >> > > > >
> >> >> > > > > > Hi All
> >> >> > > > > >
> >> >> > > > > > A few people in the Spark community are interested in
> >> writing a
> >> >> > > testing
> >> >> > > > > > library for Airflow. We would love anyone who uses Airflow
> >> >> heavily
> >> >> > in
> >> >> > > > > > production to be involved
> >> >> > > > > >
> >> >> > > > > > At the moment (AFAIK) testing your DAGs is a bit of a pain,
> >> >> > > especially if
> >> >> > > > > > you want to run them in a CI server
> >> >> > > > > >
> >> >> > > > > > Is anyone interested in being involved in the discussion?
> >> >> > > > > >
> >> >> > > > > > Kind Regards
> >> >> > > > > > Sam
> >> >> > > > > >
> >> >> > > > >
> >> >> > >
> >> >> >
> >> >>
> >> >
> >>
> >
>