No strong opinion - but it seems like generic is the easiest for us to code (as we have most of it already via hooks?) and adopt (and doesn't place a hard requirement on Beam/JVM, even if JVM would only be runtime. Still)
This is possibly where Airflow has a core TransferOperator, and providers.apache.beam.operators.BeamTransferOperator? If the "same" python API could be used for both, and it doesn't needlessly complicated things. -a On 6 September 2020 16:20:37 BST, Tomasz Urbaszek <[email protected]> wrote: >Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This >one looks really interesting for blob storages transfer! > >As stated in the initial design doc I don't think we should focus on >best performance but rather on versatility. Currently, we have many >AtoB operators that do not yield the highest performance but do their >work and are widely used. > >I would say that we should prepare an AIP that will propose two >approaches: generic vs beam. This will allow us to compare them and >then we can vote which one is better from the Airflow community >perspective. > >What do you think? > >Tomek > > >On Sun, Sep 6, 2020 at 2:42 PM Ash Berlin-Taylor <[email protected]> >wrote: >> >> For background: in the past I had an S3 to S3 transfer using >smartopen (since we wanted to split one giant ~300GB file onto smaller >parts) and it took about 10mins, so even "large" uses can work fine in >Airflow - no JVM required. >> >> -ash >> >> On 6 September 2020 12:01:24 BST, Tomasz Urbaszek ><[email protected]> wrote: >> >I think using direct runner as default with the option to specify >> >other setup is a win-win. However, there are few doubts I have about >> >Beam based approach: >> > >> >1. Dependency management. If I do `pip install apache-airflow[gcp]` >> >will it install `apache-beam[gcp]`? What if there's a version clash >> >between dependencies? >> > >> >2. The initial approach using `DataSource` concept allowed users to >> >use it in any operator (not only transfer ones). In case of relying >on >> >Beam we are losing this. >> > >> >3. I'm not a Beam expert but it seems to not support any data >lineage >> >solution? >> > >> > >> >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman >> ><[email protected]> wrote: >> >> >> >> I think there are absolutely use-cases for both. I’m totally fine >> >with saying “for small/medium use-cases, we come with an in-house >> >system. However for larger cases, you’ll require spark/Flink/S3. >That’s >> >totally in line with PLENTY of use-cases. This would be especially >cool >> >when matched with fast-follow as we could EVEN potentially tie in >data >> >locality. >> >> >> >> via Newton Mail >> >>[https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2] >> >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett >> ><[email protected]> wrote: >> >> I believe - for not large data - the direct runner is wholly >doable, >> >which >> >> seems in line with airflow patterns. I have, and have spoken with >> >several >> >> others that have, been productive with that runner. >> >> >> >> For much larger transfers, the generic operator could accept >> >parameters for >> >> submitting the compute to an actual runner. Though, imagining that >> >> (needing a runner) would not be the primary use case for such an >> >operator. >> >> >> >> >> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek ><[email protected]> >> >wrote: >> >> >> >> > Austin, you are right, Beam covers all (and more) important IOs. >> >> > However, using Apache Beam to design a generic transfer operator >> >> > requires Airflow users to have additional resources that will be >> >used >> >> > as a runner (Spark, Flink, etc.). Unless you suggest using >> >> > DirectRunner? >> >> > >> >> > Can you please tell us more how exactly you think we can use >Beam >> >for >> >> > those Airflow transfer operators? >> >> > >> >> > Best, >> >> > Tomek >> >> > >> >> > >> >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett >> >> > <[email protected]> wrote: >> >> > > >> >> > > Are there IOs that would be desired for a generic transfer >> >operator that >> >> > > don't exist in: >> >https://beam.apache.org/documentation/io/built-in/ <- >> >> > > there is pretty solid coverage? >> >> > > >> >> > > Beam is getting to the point where even python beam can >leverage >> >the java >> >> > > IOs, which increases the range of IOs (and performance). >> >> > > >> >> > > >> >> > > >> >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk >> ><[email protected]> >> >> > > wrote: >> >> > > >> >> > > > But I believe those two ideas are separate ones as Tomek >> >explained :) >> >> > > > >> >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk >> ><[email protected] >> >> > > >> >> > > > wrote: >> >> > > > >> >> > > > > I love the idea of connecting the projects more closely! >> >> > > > > >> >> > > > > I've been helping recently as a consultant in improving >the >> >Apache >> >> > Beam >> >> > > > > build infrastructure (in many parts based on my Airflow >> >experience >> >> > and >> >> > > > > Github Actions - even recently they adopted the "cancel" >> >action I >> >> > > > developed >> >> > > > > for Apache Airflow). >> >https://github.com/apache/beam/pull/12729 >> >> > > > > >> >> > > > > Synergies in Apache projects are cool. >> >> > > > > >> >> > > > > J. >> >> > > > > >> >> > > > > >> >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez >> >> > > > > <[email protected]> wrote: >> >> > > > > >> >> > > > >> Agree on keeping those separate, just intervened as I >> >believe its a >> >> > > > great >> >> > > > >> idea. But lets keep @beam and @spark to a separate >thread. >> >> > > > >> >> >> > > > >> >> >> > > > >> Gerard Casas Saez >> >> > > > >> Twitter | Cortex | @casassaez ><http://twitter.com/casassaez> >> >> > > > >> >> >> > > > >> >> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek < >> >> > [email protected]> >> >> > > > >> wrote: >> >> > > > >> >> >> > > > >> > Daniel is right we have few Apache Beam committers in >> >Polidea so >> >> > we >> >> > > > >> > will ask for advice. However, I would be highly in >favor >> >of >> >> > having it >> >> > > > >> > as Gerard suggested as @beam decorator. This is >something >> >we >> >> > should >> >> > > > >> > put into another AIP together with the mentioned @spark >> >decorator. >> >> > > > >> > >> >> > > > >> > Our proposition of transfer operators was mainly to >create >> >> > something >> >> > > > >> > Airflow-native that works out of the box and allows us >to >> >simplify >> >> > > > >> > read/write from external sources. Thus, it requires no >> >external >> >> > > > >> > dependency other than the library to communicate with >the >> >API. In >> >> > the >> >> > > > >> > case of Beam we need more than that I think. >> >> > > > >> > >> >> > > > >> > Additionally, the ideas of Source and Destination play >> >nicely with >> >> > > > >> > data lineage and may bring more interest to this >feature >> >of >> >> > Airflow. >> >> > > > >> > >> >> > > > >> > Cheers, >> >> > > > >> > Tomek >> >> > > > >> > >> >> > > > >> > >> >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik >> ><[email protected]> >> >> > > > wrote: >> >> > > > >> > > >> >> > > > >> > > Nice. Just a note here, we will need to make sure >that >> >those >> >> > > > "Source" >> >> > > > >> and >> >> > > > >> > > "Destination" needs to be serializable. >> >> > > > >> > > >> >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman < >> >> > > > [email protected] >> >> > > > >> > >> >> > > > >> > > wrote: >> >> > > > >> > > >> >> > > > >> > > > Interesting! Beam also could potentially allow >> >transfers >> >> > within >> >> > > > >> > Dask/any >> >> > > > >> > > > other system with a java/python SDK? I think @jarek >> >and >> >> > Polidea >> >> > > > do a >> >> > > > >> > lot of >> >> > > > >> > > > work with Beam as well so I’d love their thoughts >if >> >this a >> >> > good >> >> > > > >> > use-case. >> >> > > > >> > > > >> >> > > > >> > > > via Newton Mail [ >> >> > > > >> > > > >> >> > > > >> > >> >> > > > >> >> >> > > > >> >> > >> >>https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2 >> >> > > > >> > > > ] >> >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez >< >> >> > > > >> > [email protected]> >> >> > > > >> > > > wrote: >> >> > > > >> > > > I would be highly in favour of having a generic >Beam >> >operator. >> >> > > > >> Similar >> >> > > > >> > > > to @spark_task decorator. Something where you can >> >easily >> >> > define >> >> > > > and >> >> > > > >> > wrap a >> >> > > > >> > > > beam pipeline and convert it to an Airflow >operator. >> >> > > > >> > > > >> >> > > > >> > > > Gerard Casas Saez >> >> > > > >> > > > Twitter | Cortex | @casassaez >> ><http://twitter.com/casassaez> >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett < >> >> > > > >> > > > [email protected]> >> >> > > > >> > > > wrote: >> >> > > > >> > > > >> >> > > > >> > > > > Are you guys familiar with Beam >> ><https://beam.apache.org>? >> >> > Esp. >> >> > > > >> if >> >> > > > >> > not >> >> > > > >> > > > > doing transforms, it might rather straightforward >to >> >rely >> >> > on the >> >> > > > >> > > > ecosystem >> >> > > > >> > > > > of connectors in that Apache Project to use as >the >> >> > foundations >> >> > > > >> for a >> >> > > > >> > > > > generic transfer operator. >> >> > > > >> > > > > >> >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk < >> >> > > > >> > [email protected]> >> >> > > > >> > > > > wrote: >> >> > > > >> > > > > >> >> > > > >> > > > > > +1 >> >> > > > >> > > > > > >> >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski >< >> >> > > > >> > > > > > [email protected]> >> >> > > > >> > > > > > wrote: >> >> > > > >> > > > > > >> >> > > > >> > > > > > > Hello all, >> >> > > > >> > > > > > > since there have been no new comments shared >in >> >the POC >> >> > doc >> >> > > > >> > > > > > > < >> >> > > > >> > > > > > > >> >> > > > >> > > > > > >> >> > > > >> > > > > >> >> > > > >> > > > >> >> > > > >> > >> >> > > > >> >> >> > > > >> >> > >> >>https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit >> >> > > > >> > > > > > > > >> >> > > > >> > > > > > > for a couple of days, then I will proceed >with >> >creating >> >> > an >> >> > > > AIP >> >> > > > >> > for >> >> > > > >> > > > this >> >> > > > >> > > > > > > feature, if that is ok with everybody. >> >> > > > >> > > > > > > Best regards, >> >> > > > >> > > > > > > Kamil >> >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz >Urbaszek >> >< >> >> > > > >> > > > [email protected] >> >> > > > >> > > > > > >> >> > > > >> > > > > > > wrote: >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > > I like the approach as it itnroduces >another >> >> > interesting >> >> > > > >> > operators' >> >> > > > >> > > > > > > > interface standarization. It would be >awesome >> >to here >> >> > more >> >> > > > >> > opinions >> >> > > > >> > > > > :) >> >> > > > >> > > > > > > > >> >> > > > >> > > > > > > > Cheers, >> >> > > > >> > > > > > > > Tomek >> >> > > > >> > > > > > > > >> >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek >Potiuk < >> >> > > > >> > > > > [email protected] >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > > wrote: >> >> > > > >> > > > > > > > >> >> > > > >> > > > > > > > > I like the idea a lot. Similar things >have >> >been >> >> > > > discussed >> >> > > > >> > before >> >> > > > >> > > > > but >> >> > > > >> > > > > > > the >> >> > > > >> > > > > > > > > proposal is I think rather pragmatic and >> >solves a >> >> > real >> >> > > > >> > problem >> >> > > > >> > > > (and >> >> > > > >> > > > > > it >> >> > > > >> > > > > > > > does >> >> > > > >> > > > > > > > > not seem to be too complex to implement) >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > There is some discussion about it already >in >> >the >> >> > > > document >> >> > > > >> > (please >> >> > > > >> > > > > > > > chime-in >> >> > > > >> > > > > > > > > for those interested) but here a few >points >> >why I >> >> > like >> >> > > > it: >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > - performance and optimization is not a >> >focus for >> >> > that. >> >> > > > >> For >> >> > > > >> > > > generic >> >> > > > >> > > > > > > stuff >> >> > > > >> > > > > > > > > it is usually to write "optimal" solution >> >but once >> >> > you >> >> > > > >> admit >> >> > > > >> > you >> >> > > > >> > > > > are >> >> > > > >> > > > > > > not >> >> > > > >> > > > > > > > > going to focus for optimisation, you come >> >with >> >> > simpler >> >> > > > and >> >> > > > >> > easier >> >> > > > >> > > > > to >> >> > > > >> > > > > > > use >> >> > > > >> > > > > > > > > solutions >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > - on the other hand - it uses very >> >"Python'y" >> >> > approach >> >> > > > >> with >> >> > > > >> > using >> >> > > > >> > > > > > > > > Airflow's familiar concepts (connection, >> >transfer) >> >> > and >> >> > > > has >> >> > > > >> > the >> >> > > > >> > > > > > > potential >> >> > > > >> > > > > > > > of >> >> > > > >> > > > > > > > > plugging in into 100s of hooks we have >> >already >> >> > easily - >> >> > > > >> > > > leveraging >> >> > > > >> > > > > > all >> >> > > > >> > > > > > > > the >> >> > > > >> > > > > > > > > "providers" richness of Airflow. >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > - it aims to be easy to do "quick start" >- >> >if you >> >> > have a >> >> > > > >> > number >> >> > > > >> > > > of >> >> > > > >> > > > > > > > > different sources/targets and as a data >> >scientist >> >> > you >> >> > > > >> would >> >> > > > >> > like >> >> > > > >> > > > to >> >> > > > >> > > > > > > > quickly >> >> > > > >> > > > > > > > > start transferring data between them - >you >> >can do it >> >> > > > >> easily >> >> > > > >> > with >> >> > > > >> > > > > > only >> >> > > > >> > > > > > > > > basic python knowledge and simple DAG >> >structure. >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > - it should be possible to plug it in >into >> >our new >> >> > > > >> functional >> >> > > > >> > > > > > approach >> >> > > > >> > > > > > > as >> >> > > > >> > > > > > > > > well as future lineage discussions as it >> >makes >> >> > > > connection >> >> > > > >> > between >> >> > > > >> > > > > > > sources >> >> > > > >> > > > > > > > > and targets >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > - it opens up possibilities of adding >simple >> >and >> >> > > > flexible >> >> > > > >> > data >> >> > > > >> > > > > > > > > transformation on-transfer. Not a >> >replacement for >> >> > any of >> >> > > > >> the >> >> > > > >> > > > > external >> >> > > > >> > > > > > > > > services that Airflow should use (Airflow >is >> >an >> >> > > > >> > orchestrator, not >> >> > > > >> > > > > > data >> >> > > > >> > > > > > > > > processing solution) but for the kind of >> >quick-start >> >> > > > >> > scenarios I >> >> > > > >> > > > > > > foresee >> >> > > > >> > > > > > > > it >> >> > > > >> > > > > > > > > might be most useful, being able to apply >> >simple >> >> > data >> >> > > > >> > > > > transformation >> >> > > > >> > > > > > on >> >> > > > >> > > > > > > > the >> >> > > > >> > > > > > > > > fly by data scientist might be a big >plus. >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format >of >> >the >> >> > "data" >> >> > > > >> > component >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > Kamil - you should have access now. >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > J. >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil >> >Olszewski < >> >> > > > >> > > > > > > > > [email protected]> >> >> > > > >> > > > > > > > > wrote: >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > > Hello all, >> >> > > > >> > > > > > > > > > in Polidea we have come up with an idea >> >for a >> >> > generic >> >> > > > >> > transfer >> >> > > > >> > > > > > > operator >> >> > > > >> > > > > > > > > > that would be able to transport data >> >between two >> >> > > > >> > destinations >> >> > > > >> > > > of >> >> > > > >> > > > > > > > various >> >> > > > >> > > > > > > > > > types (file, database, storage, etc.) - >> >please >> >> > find >> >> > > > the >> >> > > > >> > link >> >> > > > >> > > > > with a >> >> > > > >> > > > > > > > short >> >> > > > >> > > > > > > > > > doc with POC >> >> > > > >> > > > > > > > > > < >> >> > > > >> > > > > > > > > > >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > >> >> > > > >> > > > > > > >> >> > > > >> > > > > > >> >> > > > >> > > > > >> >> > > > >> > > > >> >> > > > >> > >> >> > > > >> >> >> > > > >> >> > >> >>https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing >> >> > > > >> > > > > > > > > > > >> >> > > > >> > > > > > > > > > where we can discuss the design >initially. >> >Once we >> >> > > > come >> >> > > > >> to >> >> > > > >> > the >> >> > > > >> > > > > > > initial >> >> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki >- >> >can I >> >> > ask >> >> > > > for >> >> > > > >> > > > > permission >> >> > > > >> > > > > > to >> >> > > > >> > > > > > > > do >> >> > > > >> > > > > > > > > so >> >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe >> >that >> >> > during >> >> > > > the >> >> > > > >> > > > > discussion >> >> > > > >> > > > > > we >> >> > > > >> > > > > > > > > > should definitely aim for this feature >to >> >be >> >> > released >> >> > > > >> only >> >> > > > >> > > > after >> >> > > > >> > > > > > > > Airflow >> >> > > > >> > > > > > > > > > 2.0 is out. >> >> > > > >> > > > > > > > > > >> >> > > > >> > > > > > > > > > What do you think about this idea? >Would >> >you find >> >> > such >> >> > > > >> an >> >> > > > >> > > > > operator >> >> > > > >> > > > > > > > > helpful >> >> > > > >> > > > > > > > > > in your pipelines? Maybe you already >use a >> >similar >> >> > > > >> > solution or >> >> > > > >> > > > > know >> >> > > > >> > > > > > > > > > packages that could be used to >implement >> >it? >> >> > > > >> > > > > > > > > > >> >> > > > >> > > > > > > > > > Best regards, >> >> > > > >> > > > > > > > > > -- >> >> > > > >> > > > > > > > > > >> >> > > > >> > > > > > > > > > Kamil Olszewski >> >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> | >> >Software >> >> > Engineer >> >> > > > >> > > > > > > > > > >> >> > > > >> > > > > > > > > > M: +48 503 361 783 >> >> > > > >> > > > > > > > > > E: [email protected] >> >> > > > >> > > > > > > > > > >> >> > > > >> > > > > > > > > > Unique Tech >> >> > > > >> > > > > > > > > > Check out our projects! < >> >> > > > >> https://www.polidea.com/our-work> >> >> > > > >> > > > > > > > > > >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > -- >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > Jarek Potiuk >> >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> | >> >Principal >> >> > Software >> >> > > > >> > Engineer >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129> >> >> > > > >> > > > > > > > > [image: Polidea] ><https://www.polidea.com/> >> >> > > > >> > > > > > > > > >> >> > > > >> > > > > > > > >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > -- >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > Kamil Olszewski >> >> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software >> >Engineer >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > M: +48 503 361 783 >> >> > > > >> > > > > > > E: [email protected] >> >> > > > >> > > > > > > >> >> > > > >> > > > > > > Unique Tech >> >> > > > >> > > > > > > Check out our projects! < >> >> > https://www.polidea.com/our-work> >> >> > > > >> > > > > > > >> >> > > > >> > > > > > >> >> > > > >> > > > > > >> >> > > > >> > > > > > -- >> >> > > > >> > > > > > >> >> > > > >> > > > > > Jarek Potiuk >> >> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal >> >Software >> >> > > > >> Engineer >> >> > > > >> > > > > > >> >> > > > >> > > > > > M: +48 660 796 129 <+48660796129> >> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/> >> >> > > > >> > > > > > >> >> > > > >> > > > > >> >> > > > >> > >> >> > > > >> > >> >> > > > >> > >> >> > > > >> > -- >> >> > > > >> > >> >> > > > >> > Tomasz Urbaszek >> >> > > > >> > Polidea | Software Engineer >> >> > > > >> > >> >> > > > >> > M: +48 505 628 493 >> >> > > > >> > E: [email protected] >> >> > > > >> > >> >> > > > >> > Unique Tech >> >> > > > >> > Check out our projects! >> >> > > > >> > >> >> > > > >> >> >> > > > > >> >> > > > > >> >> > > > > -- >> >> > > > > >> >> > > > > Jarek Potiuk >> >> > > > > Polidea <https://www.polidea.com/> | Principal Software >> >Engineer >> >> > > > > >> >> > > > > M: +48 660 796 129 <+48660796129> >> >> > > > > [image: Polidea] <https://www.polidea.com/> >> >> > > > > >> >> > > > > >> >> > > > >> >> > > > -- >> >> > > > >> >> > > > Jarek Potiuk >> >> > > > Polidea <https://www.polidea.com/> | Principal Software >> >Engineer >> >> > > > >> >> > > > M: +48 660 796 129 <+48660796129> >> >> > > > [image: Polidea] <https://www.polidea.com/> >> >> > > > >> >> >
