For background: in the past I had an S3 to S3 transfer using smartopen (since we wanted to split one giant ~300GB file onto smaller parts) and it took about 10mins, so even "large" uses can work fine in Airflow - no JVM required.
-ash On 6 September 2020 12:01:24 BST, Tomasz Urbaszek <[email protected]> wrote: >I think using direct runner as default with the option to specify >other setup is a win-win. However, there are few doubts I have about >Beam based approach: > >1. Dependency management. If I do `pip install apache-airflow[gcp]` >will it install `apache-beam[gcp]`? What if there's a version clash >between dependencies? > >2. The initial approach using `DataSource` concept allowed users to >use it in any operator (not only transfer ones). In case of relying on >Beam we are losing this. > >3. I'm not a Beam expert but it seems to not support any data lineage >solution? > > >On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman ><[email protected]> wrote: >> >> I think there are absolutely use-cases for both. I’m totally fine >with saying “for small/medium use-cases, we come with an in-house >system. However for larger cases, you’ll require spark/Flink/S3. That’s >totally in line with PLENTY of use-cases. This would be especially cool >when matched with fast-follow as we could EVEN potentially tie in data >locality. >> >> via Newton Mail >[https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2] >> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett ><[email protected]> wrote: >> I believe - for not large data - the direct runner is wholly doable, >which >> seems in line with airflow patterns. I have, and have spoken with >several >> others that have, been productive with that runner. >> >> For much larger transfers, the generic operator could accept >parameters for >> submitting the compute to an actual runner. Though, imagining that >> (needing a runner) would not be the primary use case for such an >operator. >> >> >> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <[email protected]> >wrote: >> >> > Austin, you are right, Beam covers all (and more) important IOs. >> > However, using Apache Beam to design a generic transfer operator >> > requires Airflow users to have additional resources that will be >used >> > as a runner (Spark, Flink, etc.). Unless you suggest using >> > DirectRunner? >> > >> > Can you please tell us more how exactly you think we can use Beam >for >> > those Airflow transfer operators? >> > >> > Best, >> > Tomek >> > >> > >> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett >> > <[email protected]> wrote: >> > > >> > > Are there IOs that would be desired for a generic transfer >operator that >> > > don't exist in: >https://beam.apache.org/documentation/io/built-in/ <- >> > > there is pretty solid coverage? >> > > >> > > Beam is getting to the point where even python beam can leverage >the java >> > > IOs, which increases the range of IOs (and performance). >> > > >> > > >> > > >> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk ><[email protected]> >> > > wrote: >> > > >> > > > But I believe those two ideas are separate ones as Tomek >explained :) >> > > > >> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk ><[email protected] >> > > >> > > > wrote: >> > > > >> > > > > I love the idea of connecting the projects more closely! >> > > > > >> > > > > I've been helping recently as a consultant in improving the >Apache >> > Beam >> > > > > build infrastructure (in many parts based on my Airflow >experience >> > and >> > > > > Github Actions - even recently they adopted the "cancel" >action I >> > > > developed >> > > > > for Apache Airflow). >https://github.com/apache/beam/pull/12729 >> > > > > >> > > > > Synergies in Apache projects are cool. >> > > > > >> > > > > J. >> > > > > >> > > > > >> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez >> > > > > <[email protected]> wrote: >> > > > > >> > > > >> Agree on keeping those separate, just intervened as I >believe its a >> > > > great >> > > > >> idea. But lets keep @beam and @spark to a separate thread. >> > > > >> >> > > > >> >> > > > >> Gerard Casas Saez >> > > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez> >> > > > >> >> > > > >> >> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek < >> > [email protected]> >> > > > >> wrote: >> > > > >> >> > > > >> > Daniel is right we have few Apache Beam committers in >Polidea so >> > we >> > > > >> > will ask for advice. However, I would be highly in favor >of >> > having it >> > > > >> > as Gerard suggested as @beam decorator. This is something >we >> > should >> > > > >> > put into another AIP together with the mentioned @spark >decorator. >> > > > >> > >> > > > >> > Our proposition of transfer operators was mainly to create >> > something >> > > > >> > Airflow-native that works out of the box and allows us to >simplify >> > > > >> > read/write from external sources. Thus, it requires no >external >> > > > >> > dependency other than the library to communicate with the >API. In >> > the >> > > > >> > case of Beam we need more than that I think. >> > > > >> > >> > > > >> > Additionally, the ideas of Source and Destination play >nicely with >> > > > >> > data lineage and may bring more interest to this feature >of >> > Airflow. >> > > > >> > >> > > > >> > Cheers, >> > > > >> > Tomek >> > > > >> > >> > > > >> > >> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik ><[email protected]> >> > > > wrote: >> > > > >> > > >> > > > >> > > Nice. Just a note here, we will need to make sure that >those >> > > > "Source" >> > > > >> and >> > > > >> > > "Destination" needs to be serializable. >> > > > >> > > >> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman < >> > > > [email protected] >> > > > >> > >> > > > >> > > wrote: >> > > > >> > > >> > > > >> > > > Interesting! Beam also could potentially allow >transfers >> > within >> > > > >> > Dask/any >> > > > >> > > > other system with a java/python SDK? I think @jarek >and >> > Polidea >> > > > do a >> > > > >> > lot of >> > > > >> > > > work with Beam as well so I’d love their thoughts if >this a >> > good >> > > > >> > use-case. >> > > > >> > > > >> > > > >> > > > via Newton Mail [ >> > > > >> > > > >> > > > >> > >> > > > >> >> > > > >> > >https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2 >> > > > >> > > > ] >> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez < >> > > > >> > [email protected]> >> > > > >> > > > wrote: >> > > > >> > > > I would be highly in favour of having a generic Beam >operator. >> > > > >> Similar >> > > > >> > > > to @spark_task decorator. Something where you can >easily >> > define >> > > > and >> > > > >> > wrap a >> > > > >> > > > beam pipeline and convert it to an Airflow operator. >> > > > >> > > > >> > > > >> > > > Gerard Casas Saez >> > > > >> > > > Twitter | Cortex | @casassaez ><http://twitter.com/casassaez> >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett < >> > > > >> > > > [email protected]> >> > > > >> > > > wrote: >> > > > >> > > > >> > > > >> > > > > Are you guys familiar with Beam ><https://beam.apache.org>? >> > Esp. >> > > > >> if >> > > > >> > not >> > > > >> > > > > doing transforms, it might rather straightforward to >rely >> > on the >> > > > >> > > > ecosystem >> > > > >> > > > > of connectors in that Apache Project to use as the >> > foundations >> > > > >> for a >> > > > >> > > > > generic transfer operator. >> > > > >> > > > > >> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk < >> > > > >> > [email protected]> >> > > > >> > > > > wrote: >> > > > >> > > > > >> > > > >> > > > > > +1 >> > > > >> > > > > > >> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski < >> > > > >> > > > > > [email protected]> >> > > > >> > > > > > wrote: >> > > > >> > > > > > >> > > > >> > > > > > > Hello all, >> > > > >> > > > > > > since there have been no new comments shared in >the POC >> > doc >> > > > >> > > > > > > < >> > > > >> > > > > > > >> > > > >> > > > > > >> > > > >> > > > > >> > > > >> > > > >> > > > >> > >> > > > >> >> > > > >> > >https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit >> > > > >> > > > > > > > >> > > > >> > > > > > > for a couple of days, then I will proceed with >creating >> > an >> > > > AIP >> > > > >> > for >> > > > >> > > > this >> > > > >> > > > > > > feature, if that is ok with everybody. >> > > > >> > > > > > > Best regards, >> > > > >> > > > > > > Kamil >> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek >< >> > > > >> > > > [email protected] >> > > > >> > > > > > >> > > > >> > > > > > > wrote: >> > > > >> > > > > > > >> > > > >> > > > > > > > I like the approach as it itnroduces another >> > interesting >> > > > >> > operators' >> > > > >> > > > > > > > interface standarization. It would be awesome >to here >> > more >> > > > >> > opinions >> > > > >> > > > > :) >> > > > >> > > > > > > > >> > > > >> > > > > > > > Cheers, >> > > > >> > > > > > > > Tomek >> > > > >> > > > > > > > >> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk < >> > > > >> > > > > [email protected] >> > > > >> > > > > > > >> > > > >> > > > > > > > wrote: >> > > > >> > > > > > > > >> > > > >> > > > > > > > > I like the idea a lot. Similar things have >been >> > > > discussed >> > > > >> > before >> > > > >> > > > > but >> > > > >> > > > > > > the >> > > > >> > > > > > > > > proposal is I think rather pragmatic and >solves a >> > real >> > > > >> > problem >> > > > >> > > > (and >> > > > >> > > > > > it >> > > > >> > > > > > > > does >> > > > >> > > > > > > > > not seem to be too complex to implement) >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > There is some discussion about it already in >the >> > > > document >> > > > >> > (please >> > > > >> > > > > > > > chime-in >> > > > >> > > > > > > > > for those interested) but here a few points >why I >> > like >> > > > it: >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > - performance and optimization is not a >focus for >> > that. >> > > > >> For >> > > > >> > > > generic >> > > > >> > > > > > > stuff >> > > > >> > > > > > > > > it is usually to write "optimal" solution >but once >> > you >> > > > >> admit >> > > > >> > you >> > > > >> > > > > are >> > > > >> > > > > > > not >> > > > >> > > > > > > > > going to focus for optimisation, you come >with >> > simpler >> > > > and >> > > > >> > easier >> > > > >> > > > > to >> > > > >> > > > > > > use >> > > > >> > > > > > > > > solutions >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > - on the other hand - it uses very >"Python'y" >> > approach >> > > > >> with >> > > > >> > using >> > > > >> > > > > > > > > Airflow's familiar concepts (connection, >transfer) >> > and >> > > > has >> > > > >> > the >> > > > >> > > > > > > potential >> > > > >> > > > > > > > of >> > > > >> > > > > > > > > plugging in into 100s of hooks we have >already >> > easily - >> > > > >> > > > leveraging >> > > > >> > > > > > all >> > > > >> > > > > > > > the >> > > > >> > > > > > > > > "providers" richness of Airflow. >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > - it aims to be easy to do "quick start" - >if you >> > have a >> > > > >> > number >> > > > >> > > > of >> > > > >> > > > > > > > > different sources/targets and as a data >scientist >> > you >> > > > >> would >> > > > >> > like >> > > > >> > > > to >> > > > >> > > > > > > > quickly >> > > > >> > > > > > > > > start transferring data between them - you >can do it >> > > > >> easily >> > > > >> > with >> > > > >> > > > > > only >> > > > >> > > > > > > > > basic python knowledge and simple DAG >structure. >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > - it should be possible to plug it in into >our new >> > > > >> functional >> > > > >> > > > > > approach >> > > > >> > > > > > > as >> > > > >> > > > > > > > > well as future lineage discussions as it >makes >> > > > connection >> > > > >> > between >> > > > >> > > > > > > sources >> > > > >> > > > > > > > > and targets >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > - it opens up possibilities of adding simple >and >> > > > flexible >> > > > >> > data >> > > > >> > > > > > > > > transformation on-transfer. Not a >replacement for >> > any of >> > > > >> the >> > > > >> > > > > external >> > > > >> > > > > > > > > services that Airflow should use (Airflow is >an >> > > > >> > orchestrator, not >> > > > >> > > > > > data >> > > > >> > > > > > > > > processing solution) but for the kind of >quick-start >> > > > >> > scenarios I >> > > > >> > > > > > > foresee >> > > > >> > > > > > > > it >> > > > >> > > > > > > > > might be most useful, being able to apply >simple >> > data >> > > > >> > > > > transformation >> > > > >> > > > > > on >> > > > >> > > > > > > > the >> > > > >> > > > > > > > > fly by data scientist might be a big plus. >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of >the >> > "data" >> > > > >> > component >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > Kamil - you should have access now. >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > J. >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil >Olszewski < >> > > > >> > > > > > > > > [email protected]> >> > > > >> > > > > > > > > wrote: >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > > Hello all, >> > > > >> > > > > > > > > > in Polidea we have come up with an idea >for a >> > generic >> > > > >> > transfer >> > > > >> > > > > > > operator >> > > > >> > > > > > > > > > that would be able to transport data >between two >> > > > >> > destinations >> > > > >> > > > of >> > > > >> > > > > > > > various >> > > > >> > > > > > > > > > types (file, database, storage, etc.) - >please >> > find >> > > > the >> > > > >> > link >> > > > >> > > > > with a >> > > > >> > > > > > > > short >> > > > >> > > > > > > > > > doc with POC >> > > > >> > > > > > > > > > < >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > >> > > > >> > > > > >> > > > >> > > > >> > > > >> > >> > > > >> >> > > > >> > >https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing >> > > > >> > > > > > > > > > > >> > > > >> > > > > > > > > > where we can discuss the design initially. >Once we >> > > > come >> > > > >> to >> > > > >> > the >> > > > >> > > > > > > initial >> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki - >can I >> > ask >> > > > for >> > > > >> > > > > permission >> > > > >> > > > > > to >> > > > >> > > > > > > > do >> > > > >> > > > > > > > > so >> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe >that >> > during >> > > > the >> > > > >> > > > > discussion >> > > > >> > > > > > we >> > > > >> > > > > > > > > > should definitely aim for this feature to >be >> > released >> > > > >> only >> > > > >> > > > after >> > > > >> > > > > > > > Airflow >> > > > >> > > > > > > > > > 2.0 is out. >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > > What do you think about this idea? Would >you find >> > such >> > > > >> an >> > > > >> > > > > operator >> > > > >> > > > > > > > > helpful >> > > > >> > > > > > > > > > in your pipelines? Maybe you already use a >similar >> > > > >> > solution or >> > > > >> > > > > know >> > > > >> > > > > > > > > > packages that could be used to implement >it? >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > > Best regards, >> > > > >> > > > > > > > > > -- >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > > Kamil Olszewski >> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> | >Software >> > Engineer >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > > M: +48 503 361 783 >> > > > >> > > > > > > > > > E: [email protected] >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > > Unique Tech >> > > > >> > > > > > > > > > Check out our projects! < >> > > > >> https://www.polidea.com/our-work> >> > > > >> > > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > -- >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > Jarek Potiuk >> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> | >Principal >> > Software >> > > > >> > Engineer >> > > > >> > > > > > > > > >> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129> >> > > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/> >> > > > >> > > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > > -- >> > > > >> > > > > > > >> > > > >> > > > > > > Kamil Olszewski >> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software >Engineer >> > > > >> > > > > > > >> > > > >> > > > > > > M: +48 503 361 783 >> > > > >> > > > > > > E: [email protected] >> > > > >> > > > > > > >> > > > >> > > > > > > Unique Tech >> > > > >> > > > > > > Check out our projects! < >> > https://www.polidea.com/our-work> >> > > > >> > > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > -- >> > > > >> > > > > > >> > > > >> > > > > > Jarek Potiuk >> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal >Software >> > > > >> Engineer >> > > > >> > > > > > >> > > > >> > > > > > M: +48 660 796 129 <+48660796129> >> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/> >> > > > >> > > > > > >> > > > >> > > > > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > -- >> > > > >> > >> > > > >> > Tomasz Urbaszek >> > > > >> > Polidea | Software Engineer >> > > > >> > >> > > > >> > M: +48 505 628 493 >> > > > >> > E: [email protected] >> > > > >> > >> > > > >> > Unique Tech >> > > > >> > Check out our projects! >> > > > >> > >> > > > >> >> > > > > >> > > > > >> > > > > -- >> > > > > >> > > > > Jarek Potiuk >> > > > > Polidea <https://www.polidea.com/> | Principal Software >Engineer >> > > > > >> > > > > M: +48 660 796 129 <+48660796129> >> > > > > [image: Polidea] <https://www.polidea.com/> >> > > > > >> > > > > >> > > > >> > > > -- >> > > > >> > > > Jarek Potiuk >> > > > Polidea <https://www.polidea.com/> | Principal Software >Engineer >> > > > >> > > > M: +48 660 796 129 <+48660796129> >> > > > [image: Polidea] <https://www.polidea.com/> >> > > > >> >
