> > 1) The more dags I have in a dags folder, the longer time it takes to > parse them all. Taking into account that in my case I have also to parse > CWL files, it takes even more time for such a simple operation. So I was > wondering is there any common solution to approach this issue. Also, I was > thinking if I can use your Plugins mechanism to integrate some additional > functionality such as parsing CWL files directly without making any changes > in the core of Airflow. >
As a follow-up from the "political" decision, I would say, the best solution will be to treat CWL-airflow as a separate "converter" really rather than closely integrate it with AIrflow. I would imagine that you have a separate folder with CWL files and you have a daemon watching that folder and starting the conversion process whenever any of the CWL files change and creating python DAG files in Airflow's dag folder. That seems like very loosely coupled and relying on the basic behaviour of Airflow. Also then it can be easily combined with Git-sync solution for Kubernetes or another way of synchronising DAGs. > 2) I'm working on running CWL pipelines in Kubernetes through Airflow and > one of the problems that I have to deal with is sharing directories between > the PODs. It looks like Kubernetes doesn't provide the direct solution to > this problem and mostly relies on the platform where it is installed. I > will appreciate if you direct me to the proper discussions/threads where > people solve similar problems. > There are two ways of sharing DAGs - persistent volume claims and git sync currently. Generally the approach is that you need 3rd-party distributed storage to share the dags and the synchronisation mechanism is not (yet) built-in Airflow. There is the AIP-5 Remote DAG Fetcher ( https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher) where it has been discussed at length and there is the accompanying discussion thread (shorter than discussion at the doc) https://lists.apache.org/thread.html/224d1e7d1b11e0b8314075f21b1b81708749f2899f4cce5af295e8a8@%3Cdev.airflow.apache.org%3E but I don't think anyone in the community is working actively on AIP-5 currently. I think the consensus in the community is that Airflow is solving scheduling but it does not solve distribution - so it delegates distributing files to dedicated solutions (and you can choose whichever solution you already have to do the task). This is really targeted for "corporate" deployments where usually corporates have already some distributed storage in place. Rather than force a single "distribution" solution for them, assumption is that Airflow will use whatever solution is deployed at that company. Also as next step we have plans to get rid of it completely in Airflow 2.0. Providing that we will implement full DAG Serialisation - this problem will be gone. All the DAG data will be stored in the database and hopefully no more volume sharing will be needed. Here you can find simple description of using PVC's with Airflow on Kubernetes: https://medium.com/@ramandumcs/how-to-run-apache-airflow-on-kubernetes-1cb809a8c7ea . Git Sync is also nice - but requires a shared Git repo where DAGs are shared. There are other solutions - Composer team for example uses 'gcsfuse' - a user-space synchronisation from a GCS bucket to local pod volume (they have two containers in a pod - gcsfuse as side-car to airflow worker, scheduler, UI sharing single volume). Then it is a matter of putting the generated Dags to a GCS bucket (your daemon could do just that). And you can use similar solutions for other dedicated "artifact" sharing. For example we've implemented similar side-car pod for Nexus - where production DAG files were shared as Nexus artifacts. > Thanks a lot, > Michael > > > > > > > > On 2019/11/15 10:17:30, Jarek Potiuk <[email protected]> wrote: > > I am also -1. But I am happy to help with surfacing the CWL integration > on > > - both the new package (together with Oozie-2-airflow and maybe other > > converters) and having it easily installable as external Package. I will > > talk to Andrey separately about this so that we do not clutter the > devlist. > > > > J. > > > > On Fri, Nov 15, 2019 at 7:37 AM Maxime Beauchemin < > > [email protected]> wrote: > > > > > After all the exploration of this topic here in this thread, I'm a > pretty > > > hard -1 on this one. > > > > > > I think CWL and CWL-Airflow are great projects, but they can't rely on > the > > > Airflow community to evolve/maintain/package this integration. > > > > > > Personally I think that generally and *within reason* (winking at the > npm > > > communities ;) that smaller, targeted and loosely coupled packages [and > > > their corresponding smaller repositories with their own set of > maintainers] > > > is better than bigger monoliths. Some reasons: > > > * separation of concerns > > > * faster, more targeted builds and test suites > > > * independent release cycles > > > * clearer ownership > > > * independent and adapted level of rigor / styling / standards > > > * more targeted notifications for people watching repos > > > * ... > > > > > > Max > > > > > > On Thu, Nov 14, 2019 at 12:33 PM Andrey Kartashov <[email protected]> > > > wrote: > > > > > > > > > > > > > > > I looked at the > > > > > > > > > > > > > https://cwl-airflow.readthedocs.io/en/1.0.18/readme/how_it_works.html#what-s-inside > > > > > to > > > > > understand what CWL is and that's where I took the descriptor + > job (in > > > > Key > > > > > Concepts). > > > > > > > > > > > > > Oh this is an old one, but even new one probably does not reflect the > > > real > > > > picture. > > > > > > > > > > > > OK. So as I understand finally the problem you want to solve - "To > make > > > > > Airflow more accessible to people who already use CWL or who will > find > > > it > > > > > easier to write dags in CWL". I still think this does not > necessarily > > > > have > > > > > to be solved by donating CWL code to Airflow (see below). > > > > > > > > > > > > > I think there are many ways. > > > > > > > > > > > > > Ok. So what you basically say is that you think Airflow community > has > > > > more > > > > > capacity than CWL community to maintain CWL converter. > > > > > > > > My understanding CWL community just developing common standard (CWL) > not > > > > converters or converter :). For me the CWL-Airflow developer > definitely > > > > Airflow community has far more capacity that me alone :) > > > > > > > > > I am not so sure > > > > > about it (precisely because of the lost opportunities). But maybe a > > > > better > > > > > solution is to ask in the airflow community whether there are > people > > > who > > > > > could join the CWL-airflow converter and increase the community > there. > > > > > > > > > > > > > That probably a good start just to check and see the interest > > > > > > > > > I would not say for the whole community, but I would not feel > > > comfortable > > > > > as a community to take responsibility on the converter without > prior > > > > > knowledge and understanding CWL in detail. Especially that it is > rather > > > > for > > > > > small group of users (at least initially). But I find CWL as an > idea > > > very > > > > > interesting and maybe there are some people in the community who > would > > > > love > > > > > to contribute to your project? Suggestion - maybe just ask - here > and > > > in > > > > > slack - if there is enough interest in contributing to CWL-Airflow, > > > > rather > > > > > than donating the code to Airflow ? Just promote your project in > the > > > > > community and ask for help. > > > > > > > > I tried but have not got any feedback :) but I’m not a promoter or > seller > > > > > > > > > > > > > > > > > > I can see this as the best of both worlds - if you find a few > people > > > who > > > > > would like to help and get familiar with it and they are also part > of > > > the > > > > > Airflow community and we get collective knowledge about it - then > > > > > eventually it might lead to incorporating it to Airflow itself if > our > > > > > community gets more familiar with CWL. I think this is the best > way to > > > > > achieve the final goal of finally incorporating CWL as part of > Airflow. > > > > > > > > > > > > > Works for me > > > > > > > > > > > > > In the meantime - I am happy to help to make Airflow more "CWL > > > friendly" > > > > > for the users - both from documentation and Helm chart POV. > > > > > > > > > > > > > Thank you, I appreciate that, how we proceed? > > > > > > > > > > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
