Re: Contrib & Dataflow

Laura Lorenz Thu, 09 Feb 2017 10:13:32 -0800

Re: data storage and file reference metadata as a process of the
post_execute hook
I'm interested to hear more on this idea, as I can't visualize how (or if)
that will implement multi-backend IO and either a standard or drop in
serialization of result objects.

I did just comment on the PR
<https://github.com/apache/incubator-airflow/pull/2046#issuecomment-278722340>
re: Max's comments since I wasn't totally sure where that conversation
should be had, but can move it over here if we want more visibility.

Re: breaking out repos
I know this has had some support for a while from eavesdropping this list
or committer meeting reports, but I want to throw out some of the gotchas
we experienced from deriving our own plugins (using the Airflow plugin
system <https://airflow.incubator.apache.org/plugins.html>) and then, when
that was too unwieldy for us because of the plugin module discovery system,
packaging some of our custom operators and hooks separately (fileflow
<https://www.github.com/industrydive/fileflow>). In the latter case, which
is closer to what you are proposing, we had problems patching into the core
Airflow configuration management system
<https://github.com/industrydive/fileflow/pull/6/commits/9374b02444d4d9b69121c5605f67d48e22a031fa>.
Now this could have been just us (or fixed up since Airflow 1.7.0, which is
the version we are still operating on) but is just a word of caution on
things to consider or redesign, given what we experienced packaging Airflow
add ons separately.

On Sat, Feb 4, 2017 at 1:45 PM, Jeremiah Lowin <[email protected]> wrote:

> Max made some great points on my dataflow PR and I wanted to continue the
> conversation here to make sure the conversation was visible to all.
>
> While I think my dataflow implementation contains the basic requirements
> for any more complicated extension (but that conversation can wait!), I had
> to implement it by adding some very specific "dataflow-only" code to core
> Operator logic. In retrospect, that makes me pause (as, I believe, it did
> for Max).
>
> After thinking for a few days, what I really want to do is propose a very
> small change to core Airflow: change BaseOperator.post_execute(context) to
> BaseOperator.post_execute(result, context). I think the pre_execute and
> post_execute hooks have generally been an afterthought, but with that
> change (which, I think, is reasonable in and of itself) I could implement
> entirely through those hooks.
>
> So that brings me to my next point: if the hook is changed, I could happily
> drop a reworked dataflow implementation into contrib, rather than core.
> That would alleviate some of the pressure for Airflow to officially decide
> whether it's the right implementation or not (it is! :) ). I feel like that
> would be the optimal situation at the moment.
>
> And that brings me to my next point: the future of "contrib" and the
> Airflow community.
> Having contrib in the core Airflow repo has some advantages:
>   - standardized access
>   - centralized repository for PRs
>   - at least a style review (if not unit tests) from the committers
> But some big disadvantages as well:
>   - Very complicated dependency management [presumably, most contrib
> operators need to add an extras_require entry for their specific
> dependencies]
>   - No sense of ownership or even an easy way to raise issues (due to
> friction of opening JIRA tickets vs github issues)
>
> One thought is to move the contrib directory to its own repo which would
> keep the advantages but remove the disadvantages from core Airflow. Another
> is to encourage individual airflow repos (Airflow-Docker, Airflow-Dataflow,
> Airflow-YourExtensionHere) which could be installed a la carte. That would
> leave maintenance up to the original author, but could lead to some
> fracturing in the community as discovery becomes difficult.
>

Re: Contrib & Dataflow

Reply via email to