I think using direct runner as default with the option to specify other setup is a win-win. However, there are few doubts I have about Beam based approach:
1. Dependency management. If I do `pip install apache-airflow[gcp]` will it install `apache-beam[gcp]`? What if there's a version clash between dependencies? 2. The initial approach using `DataSource` concept allowed users to use it in any operator (not only transfer ones). In case of relying on Beam we are losing this. 3. I'm not a Beam expert but it seems to not support any data lineage solution? On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman <[email protected]> wrote: > > I think there are absolutely use-cases for both. I’m totally fine with saying > “for small/medium use-cases, we come with an in-house system. However for > larger cases, you’ll require spark/Flink/S3. That’s totally in line with > PLENTY of use-cases. This would be especially cool when matched with > fast-follow as we could EVEN potentially tie in data locality. > > via Newton Mail > [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2] > On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett <[email protected]> > wrote: > I believe - for not large data - the direct runner is wholly doable, which > seems in line with airflow patterns. I have, and have spoken with several > others that have, been productive with that runner. > > For much larger transfers, the generic operator could accept parameters for > submitting the compute to an actual runner. Though, imagining that > (needing a runner) would not be the primary use case for such an operator. > > > On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <[email protected]> wrote: > > > Austin, you are right, Beam covers all (and more) important IOs. > > However, using Apache Beam to design a generic transfer operator > > requires Airflow users to have additional resources that will be used > > as a runner (Spark, Flink, etc.). Unless you suggest using > > DirectRunner? > > > > Can you please tell us more how exactly you think we can use Beam for > > those Airflow transfer operators? > > > > Best, > > Tomek > > > > > > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett > > <[email protected]> wrote: > > > > > > Are there IOs that would be desired for a generic transfer operator that > > > don't exist in: https://beam.apache.org/documentation/io/built-in/ <- > > > there is pretty solid coverage? > > > > > > Beam is getting to the point where even python beam can leverage the java > > > IOs, which increases the range of IOs (and performance). > > > > > > > > > > > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk <[email protected]> > > > wrote: > > > > > > > But I believe those two ideas are separate ones as Tomek explained :) > > > > > > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <[email protected] > > > > > > > wrote: > > > > > > > > > I love the idea of connecting the projects more closely! > > > > > > > > > > I've been helping recently as a consultant in improving the Apache > > Beam > > > > > build infrastructure (in many parts based on my Airflow experience > > and > > > > > Github Actions - even recently they adopted the "cancel" action I > > > > developed > > > > > for Apache Airflow). https://github.com/apache/beam/pull/12729 > > > > > > > > > > Synergies in Apache projects are cool. > > > > > > > > > > J. > > > > > > > > > > > > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez > > > > > <[email protected]> wrote: > > > > > > > > > >> Agree on keeping those separate, just intervened as I believe its a > > > > great > > > > >> idea. But lets keep @beam and @spark to a separate thread. > > > > >> > > > > >> > > > > >> Gerard Casas Saez > > > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez> > > > > >> > > > > >> > > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek < > > [email protected]> > > > > >> wrote: > > > > >> > > > > >> > Daniel is right we have few Apache Beam committers in Polidea so > > we > > > > >> > will ask for advice. However, I would be highly in favor of > > having it > > > > >> > as Gerard suggested as @beam decorator. This is something we > > should > > > > >> > put into another AIP together with the mentioned @spark decorator. > > > > >> > > > > > >> > Our proposition of transfer operators was mainly to create > > something > > > > >> > Airflow-native that works out of the box and allows us to simplify > > > > >> > read/write from external sources. Thus, it requires no external > > > > >> > dependency other than the library to communicate with the API. In > > the > > > > >> > case of Beam we need more than that I think. > > > > >> > > > > > >> > Additionally, the ideas of Source and Destination play nicely with > > > > >> > data lineage and may bring more interest to this feature of > > Airflow. > > > > >> > > > > > >> > Cheers, > > > > >> > Tomek > > > > >> > > > > > >> > > > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <[email protected]> > > > > wrote: > > > > >> > > > > > > >> > > Nice. Just a note here, we will need to make sure that those > > > > "Source" > > > > >> and > > > > >> > > "Destination" needs to be serializable. > > > > >> > > > > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman < > > > > [email protected] > > > > >> > > > > > >> > > wrote: > > > > >> > > > > > > >> > > > Interesting! Beam also could potentially allow transfers > > within > > > > >> > Dask/any > > > > >> > > > other system with a java/python SDK? I think @jarek and > > Polidea > > > > do a > > > > >> > lot of > > > > >> > > > work with Beam as well so I’d love their thoughts if this a > > good > > > > >> > use-case. > > > > >> > > > > > > > >> > > > via Newton Mail [ > > > > >> > > > > > > > >> > > > > > >> > > > > > > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2 > > > > >> > > > ] > > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez < > > > > >> > [email protected]> > > > > >> > > > wrote: > > > > >> > > > I would be highly in favour of having a generic Beam operator. > > > > >> Similar > > > > >> > > > to @spark_task decorator. Something where you can easily > > define > > > > and > > > > >> > wrap a > > > > >> > > > beam pipeline and convert it to an Airflow operator. > > > > >> > > > > > > > >> > > > Gerard Casas Saez > > > > >> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez> > > > > >> > > > > > > > >> > > > > > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett < > > > > >> > > > [email protected]> > > > > >> > > > wrote: > > > > >> > > > > > > > >> > > > > Are you guys familiar with Beam <https://beam.apache.org>? > > Esp. > > > > >> if > > > > >> > not > > > > >> > > > > doing transforms, it might rather straightforward to rely > > on the > > > > >> > > > ecosystem > > > > >> > > > > of connectors in that Apache Project to use as the > > foundations > > > > >> for a > > > > >> > > > > generic transfer operator. > > > > >> > > > > > > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk < > > > > >> > [email protected]> > > > > >> > > > > wrote: > > > > >> > > > > > > > > >> > > > > > +1 > > > > >> > > > > > > > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski < > > > > >> > > > > > [email protected]> > > > > >> > > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > Hello all, > > > > >> > > > > > > since there have been no new comments shared in the POC > > doc > > > > >> > > > > > > < > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit > > > > >> > > > > > > > > > > > >> > > > > > > for a couple of days, then I will proceed with creating > > an > > > > AIP > > > > >> > for > > > > >> > > > this > > > > >> > > > > > > feature, if that is ok with everybody. > > > > >> > > > > > > Best regards, > > > > >> > > > > > > Kamil > > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek < > > > > >> > > > [email protected] > > > > >> > > > > > > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > >> > > > > > > > I like the approach as it itnroduces another > > interesting > > > > >> > operators' > > > > >> > > > > > > > interface standarization. It would be awesome to here > > more > > > > >> > opinions > > > > >> > > > > :) > > > > >> > > > > > > > > > > > >> > > > > > > > Cheers, > > > > >> > > > > > > > Tomek > > > > >> > > > > > > > > > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk < > > > > >> > > > > [email protected] > > > > >> > > > > > > > > > > >> > > > > > > > wrote: > > > > >> > > > > > > > > > > > >> > > > > > > > > I like the idea a lot. Similar things have been > > > > discussed > > > > >> > before > > > > >> > > > > but > > > > >> > > > > > > the > > > > >> > > > > > > > > proposal is I think rather pragmatic and solves a > > real > > > > >> > problem > > > > >> > > > (and > > > > >> > > > > > it > > > > >> > > > > > > > does > > > > >> > > > > > > > > not seem to be too complex to implement) > > > > >> > > > > > > > > > > > > >> > > > > > > > > There is some discussion about it already in the > > > > document > > > > >> > (please > > > > >> > > > > > > > chime-in > > > > >> > > > > > > > > for those interested) but here a few points why I > > like > > > > it: > > > > >> > > > > > > > > > > > > >> > > > > > > > > - performance and optimization is not a focus for > > that. > > > > >> For > > > > >> > > > generic > > > > >> > > > > > > stuff > > > > >> > > > > > > > > it is usually to write "optimal" solution but once > > you > > > > >> admit > > > > >> > you > > > > >> > > > > are > > > > >> > > > > > > not > > > > >> > > > > > > > > going to focus for optimisation, you come with > > simpler > > > > and > > > > >> > easier > > > > >> > > > > to > > > > >> > > > > > > use > > > > >> > > > > > > > > solutions > > > > >> > > > > > > > > > > > > >> > > > > > > > > - on the other hand - it uses very "Python'y" > > approach > > > > >> with > > > > >> > using > > > > >> > > > > > > > > Airflow's familiar concepts (connection, transfer) > > and > > > > has > > > > >> > the > > > > >> > > > > > > potential > > > > >> > > > > > > > of > > > > >> > > > > > > > > plugging in into 100s of hooks we have already > > easily - > > > > >> > > > leveraging > > > > >> > > > > > all > > > > >> > > > > > > > the > > > > >> > > > > > > > > "providers" richness of Airflow. > > > > >> > > > > > > > > > > > > >> > > > > > > > > - it aims to be easy to do "quick start" - if you > > have a > > > > >> > number > > > > >> > > > of > > > > >> > > > > > > > > different sources/targets and as a data scientist > > you > > > > >> would > > > > >> > like > > > > >> > > > to > > > > >> > > > > > > > quickly > > > > >> > > > > > > > > start transferring data between them - you can do it > > > > >> easily > > > > >> > with > > > > >> > > > > > only > > > > >> > > > > > > > > basic python knowledge and simple DAG structure. > > > > >> > > > > > > > > > > > > >> > > > > > > > > - it should be possible to plug it in into our new > > > > >> functional > > > > >> > > > > > approach > > > > >> > > > > > > as > > > > >> > > > > > > > > well as future lineage discussions as it makes > > > > connection > > > > >> > between > > > > >> > > > > > > sources > > > > >> > > > > > > > > and targets > > > > >> > > > > > > > > > > > > >> > > > > > > > > - it opens up possibilities of adding simple and > > > > flexible > > > > >> > data > > > > >> > > > > > > > > transformation on-transfer. Not a replacement for > > any of > > > > >> the > > > > >> > > > > external > > > > >> > > > > > > > > services that Airflow should use (Airflow is an > > > > >> > orchestrator, not > > > > >> > > > > > data > > > > >> > > > > > > > > processing solution) but for the kind of quick-start > > > > >> > scenarios I > > > > >> > > > > > > foresee > > > > >> > > > > > > > it > > > > >> > > > > > > > > might be most useful, being able to apply simple > > data > > > > >> > > > > transformation > > > > >> > > > > > on > > > > >> > > > > > > > the > > > > >> > > > > > > > > fly by data scientist might be a big plus. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of the > > "data" > > > > >> > component > > > > >> > > > > > > > > > > > > >> > > > > > > > > Kamil - you should have access now. > > > > >> > > > > > > > > > > > > >> > > > > > > > > J. > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski < > > > > >> > > > > > > > > [email protected]> > > > > >> > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > >> > > > > > > > > > Hello all, > > > > >> > > > > > > > > > in Polidea we have come up with an idea for a > > generic > > > > >> > transfer > > > > >> > > > > > > operator > > > > >> > > > > > > > > > that would be able to transport data between two > > > > >> > destinations > > > > >> > > > of > > > > >> > > > > > > > various > > > > >> > > > > > > > > > types (file, database, storage, etc.) - please > > find > > > > the > > > > >> > link > > > > >> > > > > with a > > > > >> > > > > > > > short > > > > >> > > > > > > > > > doc with POC > > > > >> > > > > > > > > > < > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > where we can discuss the design initially. Once we > > > > come > > > > >> to > > > > >> > the > > > > >> > > > > > > initial > > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki - can I > > ask > > > > for > > > > >> > > > > permission > > > > >> > > > > > to > > > > >> > > > > > > > do > > > > >> > > > > > > > > so > > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that > > during > > > > the > > > > >> > > > > discussion > > > > >> > > > > > we > > > > >> > > > > > > > > > should definitely aim for this feature to be > > released > > > > >> only > > > > >> > > > after > > > > >> > > > > > > > Airflow > > > > >> > > > > > > > > > 2.0 is out. > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > What do you think about this idea? Would you find > > such > > > > >> an > > > > >> > > > > operator > > > > >> > > > > > > > > helpful > > > > >> > > > > > > > > > in your pipelines? Maybe you already use a similar > > > > >> > solution or > > > > >> > > > > know > > > > >> > > > > > > > > > packages that could be used to implement it? > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > Best regards, > > > > >> > > > > > > > > > -- > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > Kamil Olszewski > > > > >> > > > > > > > > > Polidea <https://www.polidea.com> | Software > > Engineer > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > M: +48 503 361 783 > > > > >> > > > > > > > > > E: [email protected] > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > Unique Tech > > > > >> > > > > > > > > > Check out our projects! < > > > > >> https://www.polidea.com/our-work> > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > -- > > > > >> > > > > > > > > > > > > >> > > > > > > > > Jarek Potiuk > > > > >> > > > > > > > > Polidea <https://www.polidea.com/> | Principal > > Software > > > > >> > Engineer > > > > >> > > > > > > > > > > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129> > > > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/> > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > -- > > > > >> > > > > > > > > > > >> > > > > > > Kamil Olszewski > > > > >> > > > > > > Polidea <https://www.polidea.com> | Software Engineer > > > > >> > > > > > > > > > > >> > > > > > > M: +48 503 361 783 > > > > >> > > > > > > E: [email protected] > > > > >> > > > > > > > > > > >> > > > > > > Unique Tech > > > > >> > > > > > > Check out our projects! < > > https://www.polidea.com/our-work> > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > -- > > > > >> > > > > > > > > > >> > > > > > Jarek Potiuk > > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal Software > > > > >> Engineer > > > > >> > > > > > > > > > >> > > > > > M: +48 660 796 129 <+48660796129> > > > > >> > > > > > [image: Polidea] <https://www.polidea.com/> > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > -- > > > > >> > > > > > >> > Tomasz Urbaszek > > > > >> > Polidea | Software Engineer > > > > >> > > > > > >> > M: +48 505 628 493 > > > > >> > E: [email protected] > > > > >> > > > > > >> > Unique Tech > > > > >> > Check out our projects! > > > > >> > > > > > >> > > > > > > > > > > > > > > > -- > > > > > > > > > > Jarek Potiuk > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > > > > > M: +48 660 796 129 <+48660796129> > > > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jarek Potiuk > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > > > M: +48 660 796 129 <+48660796129> > > > > [image: Polidea] <https://www.polidea.com/> > > > > > >
