Re: data storage and file reference metadata as a process of the post_execute hook I'm interested to hear more on this idea, as I can't visualize how (or if) that will implement multi-backend IO and either a standard or drop in serialization of result objects.
I did just comment on the PR <https://github.com/apache/incubator-airflow/pull/2046#issuecomment-278722340> re: Max's comments since I wasn't totally sure where that conversation should be had, but can move it over here if we want more visibility. Re: breaking out repos I know this has had some support for a while from eavesdropping this list or committer meeting reports, but I want to throw out some of the gotchas we experienced from deriving our own plugins (using the Airflow plugin system <https://airflow.incubator.apache.org/plugins.html>) and then, when that was too unwieldy for us because of the plugin module discovery system, packaging some of our custom operators and hooks separately (fileflow <https://www.github.com/industrydive/fileflow>). In the latter case, which is closer to what you are proposing, we had problems patching into the core Airflow configuration management system <https://github.com/industrydive/fileflow/pull/6/commits/9374b02444d4d9b69121c5605f67d48e22a031fa>. Now this could have been just us (or fixed up since Airflow 1.7.0, which is the version we are still operating on) but is just a word of caution on things to consider or redesign, given what we experienced packaging Airflow add ons separately. On Sat, Feb 4, 2017 at 1:45 PM, Jeremiah Lowin <[email protected]> wrote: > Max made some great points on my dataflow PR and I wanted to continue the > conversation here to make sure the conversation was visible to all. > > While I think my dataflow implementation contains the basic requirements > for any more complicated extension (but that conversation can wait!), I had > to implement it by adding some very specific "dataflow-only" code to core > Operator logic. In retrospect, that makes me pause (as, I believe, it did > for Max). > > After thinking for a few days, what I really want to do is propose a very > small change to core Airflow: change BaseOperator.post_execute(context) to > BaseOperator.post_execute(result, context). I think the pre_execute and > post_execute hooks have generally been an afterthought, but with that > change (which, I think, is reasonable in and of itself) I could implement > entirely through those hooks. > > So that brings me to my next point: if the hook is changed, I could happily > drop a reworked dataflow implementation into contrib, rather than core. > That would alleviate some of the pressure for Airflow to officially decide > whether it's the right implementation or not (it is! :) ). I feel like that > would be the optimal situation at the moment. > > And that brings me to my next point: the future of "contrib" and the > Airflow community. > Having contrib in the core Airflow repo has some advantages: > - standardized access > - centralized repository for PRs > - at least a style review (if not unit tests) from the committers > But some big disadvantages as well: > - Very complicated dependency management [presumably, most contrib > operators need to add an extras_require entry for their specific > dependencies] > - No sense of ownership or even an easy way to raise issues (due to > friction of opening JIRA tickets vs github issues) > > One thought is to move the contrib directory to its own repo which would > keep the advantages but remove the disadvantages from core Airflow. Another > is to encourage individual airflow repos (Airflow-Docker, Airflow-Dataflow, > Airflow-YourExtensionHere) which could be installed a la carte. That would > leave maintenance up to the original author, but could lead to some > fracturing in the community as discovery becomes difficult. >
