Thanks Jarek for clearing that up. Personally I would omit the Apache one. We should not step into the fallacy as before with not being sure if it was in contrib or not. I would even consider merging software and protocols, as it not entirely clear what a protocol is or not. In the end, everything is a protocol, might be a high level (FTP) or a low-level protocol (FS).
Cheers, Fokko Cheers, Fokko Op di 29 okt. 2019 om 12:45 schreef Jarek Potiuk <jarek.pot...@polidea.com>: > Yep. We should definitely discuss the split! > > For me these are the criteria: > > - fundamentals - those are all the operators/hooks/sensors that are the > "Core" of Airflow (base, dbapi) and allow you to run basic examples, > implements basic logic of Airflow (subdags, branch etc.) + generic > operators being base for others (like generic transfer/sql) > - providers - integration with cloud providers - (PAAS) > - apache - integrations - with other ApacheSoftwareFoundation projects > - software - Integration with other software that is proprietary or > open-source that you can install on-premises (or in the cloud) > - protocols - integration with protocols that can be implemented by any > software (SFTP/mail/etc.) > - services - Integration with SAAS solutions > > From the above list I only have doubts about the "apache" one - question is > whether as part of Apache Community we want to somehow group those. > > J. > > > On Tue, Oct 29, 2019 at 11:19 AM Bas Harenslak < > basharens...@godatadriven.com> wrote: > > > 1. Sounds good to me > > 2. Also fine > > 3. We should have some consensus here. E.g. I’m not sure what groups > > “fundamentals” and “software” are meant to be :-) > > > > While we’re at it: we should really move the BaseOperator out of models. > > The BaseOperator has no representation in the DB and should be placed > > together with other scripts where it belongs, i.e. something like > > airflow.operators.base_operator. > > > > Bas > > > > On 29 Oct 2019, at 10:43, Jarek Potiuk <jarek.pot...@polidea.com<mailto: > > jarek.pot...@polidea.com>> wrote: > > > > After some consideration and seeing the actual move in practice I wanted > to > > propose 3rd amendment ;) to the AIP-21. > > I have a few observations from seeing the discussions and observing the > > actual moving process. I have the following proposals: > > > > *1) Between-providers transfer operators should be kept at the "target" > > rather than "source"* > > > > If we end up with splitting operators by groups (AIP-8 and the proposed > > Backporting to Airflow 1.10), I think it makes more sense to keep > transfer > > operators in the "target" package. For example "S3 to GCS" operator in > > "providers/google" package - simply because it is more likely that the > > individuals that will be working on the pure "GCP" services will also be > > more interested in getting the data from other cloud providers, and > likely > > they will even have some transfer services that can be used for that > > purpose (rather than using worker to transfer the data) - in the > particular > > S3-> GCS case we have GCP's > > https://cloud.google.com/storage-transfer/docs/overview which allows to > > transfer data from any other cloud provider to GCS . Same for example if > we > > imagine Athena -> Bigquery for example. At least that's the feeling I > have. > > I can imagine that the kind of "stewardship" over those groups of > operators > > can be somewhat influenced and maybe even performed by those cloud > > providers themselves. Corresponding hooks of course should be in > different > > "groups". > > > > 2) *One-side provider-neutral transfer operators should be kept at the > > "provider" regardless if they are target or source.* > > > > For example GCS-> SFTP or SFTP -> GCS. There the hook for SFTP should be > in > > the "core" package but both operators should be in "providers/google". > The > > reason is quite the same as above - the "stewardship" over all the > > operators can be done by the "provider" group. > > > > *3) Grouping non-provider operators/hooks according to their purpose.* > > > > I think it is also the right time to move the other operators/hooks to > > different groups within core. We already have some reasonable and nice > > groups proposed in the new documentation by Kamil > > https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html > and > > it only makes sense to move those now (Fundamentals, ASF: Apache Software > > Foundation, Azure: Microsoft Azure, AWS: Amazon Web Services, GCP: Google > > Cloud Platform, Service integrations, Software integrations, Protocol > > integrations). I think it would make sense to use the same approach in > the > > code: We could have > > > > > fundamentals/asf/azure(microsoft/azure?)/aws(amazon/aws?)/google/services/software/protocols) > > packages. > > > > There will be few exceptions probably but we can handle them on > > case-by-case basis. > > > > J. > > > > On Fri, Oct 11, 2019 at 3:11 PM Jarek Potiuk <jarek.pot...@polidea.com > > <mailto:jarek.pot...@polidea.com>> > > wrote: > > > > Hello everyone. I updated AIP-21 and updated examples. > > > > > > Point D. of AIP-21 is now as follows: > > > > > > > > *D. * Group operators/sensors/hooks in > > *airflow/providers/<PROVIDER>*/operators(sensors, > > hooks). > > > > Each provider can define its own internal structure of that package. For > > example in case of "google" provider the packages will be further grouped > > by "gcp", "gsuite", "core" sub-packages. > > > > In case of transfer operators where two providers are involved, the > > transfer operators will be moved to "source" of the transfer. When there > > is only one provider as target but source is a database or another > > non-provider source, the operator is put to the target provider. > > > > Non-cloud provider ones are moved to airflow/operators(sensors/hooks). > > *Drop the prefix.* > > > > Examples: > > > > AWS operator: > > > > - > > *airflow/contrib/operators/sns_publish_operator.py > > becomes airflow/providers/aws/operators/**sns_publish_operator.py* > > > > *Google GCP operator:* > > > > - *airflow/contrib/operators/dataproc_operator.py* > > becomes *airflow/providers/gooogle/gcp/operators/dataproc_operator.py* > > > > Previously GCP-prefixed operator: > > > > - > > *airflow/contrib/operators/gcp_bigtable_operator.py *becomes > > *airflow/providers/google/**gcp/operators/bigtable_operator.py* > > > > *Transfer from GCP:* > > > > - *airflow/contrib/operators/gcs_to_s3_operator.py* > > * becomes airflow/providers/google/gcp/operators/gcs_to_s3_operator.py* > > > > *MySQL to GCS:* > > > > - *airflow/contrib/operators/mysql_to_gcs_operator.py* > > * becomes airflow/providers/google/gcp/operators/* > > *mysql_to_gcs_operator.py* > > > > *SSH operator:* > > > > - > > *airflow/contrib/operators/ssh_operator.py *becomes *airflow/* > > *operators/ssh_operator.py* > > > > > > On Fri, Oct 4, 2019 at 6:22 PM Jarek Potiuk <jarek.pot...@polidea.com > > <mailto:jarek.pot...@polidea.com>> > > wrote: > > > > Yeah. I think the important point is that the latest doc changes by Kamil > > index all available operators and hooks nicely and make them easy to > find. > > > > That also includes (as of today) automated CI checking if new operators > > and hooks added are added to the documentation : > > > > > https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13 > > > > J. > > > > On Fri, Oct 4, 2019 at 5:21 PM Chris Palmer <ch...@crpalmer.com> wrote: > > > > It's not obvious to me why an S3ToMsSQLOperator in the aws package is > > "silly". Why do you say it made sense to create a MsSqlFromS3Operator? > > > > Basically all of these operators could be thought of as "move data from A > > to B" or "move data to B from A". I think what feels natural to each > > individual will depend on what their frame of reference is, and where > > their > > main focus is. If you are largely focused on MsSql then I can understand > > that it's natural to think "What MsSql operators are there?" and to > > not see S3ToMsSqlOperator > > as one of those MsSql operators. That's exactly the point I made with my > > earlier response; I was so focused on BigQuery that I didn't think to > > look > > under Cloud Storage documentation for the > > GoogleCloudStorageToBigQueryOperator. > > > > I think it is too hard to draw a very distinct line between what is just > > "storage" and what is more. There are going to be fuzzy edge cases, so > > picking a single convention is going to much less hassle in my view. As > > long as that convention is well documented and the documentation is > > improved so that it's easier to find all operators that relate to > > BigQuery > > or MsSql etc in one place (as is being done by Kamil) then that is the > > best > > we can do. > > > > Chris > > > > > > > > On Fri, Oct 4, 2019 at 10:55 AM Daniel Standish <dpstand...@gmail.com> > > wrote: > > > > One case popped up for us recently, where it made sense to make a MsSql > > *From*S3Operator . > > > > I think using "source" makes sense in general, but in this case calling > > this a S3ToMsSqlOperator and putting it under AWS seems silly, even > > though > > you could say s3 is "source" here. > > > > I think in most of these cases we say "let's use source" because > > source is > > where the actual work is done and destination is just storage. > > > > Does a guideline saying "ignore storage" or "storage is secondary in > > object > > location" make sense? > > > > > > > > On Fri, Oct 4, 2019 at 6:42 AM Jarek Potiuk <jarek.pot...@polidea.com> > > wrote: > > > > It looks like we have general consensus about putting transfer > > operators > > into "source provider" package. > > That's great for me as well. > > > > Since I will be updating AIP-21 to reflect the "google" vs. "gcp" > > case, I > > will also update it to add this decision. > > > > If no-one objects (Lazy Consensus > > <https://community.apache.org/committers/lazyConsensus.html>) till > > Monday7th of October, 3.20 CEST, we will update AIP-21 with > > information > > that transfer operators should be placed in the "source" provider > > module. > > > > J. > > > > On Tue, Sep 24, 2019 at 1:34 PM Kamil Breguła < > > kamil.breg...@polidea.com > > > > wrote: > > > > On Mon, Sep 23, 2019 at 7:42 PM Chris Palmer <ch...@crpalmer.com> > > wrote: > > > > On Mon, Sep 23, 2019 at 1:22 PM Kamil Breguła < > > kamil.breg...@polidea.com > > > > wrote: > > > > On Mon, Sep 23, 2019 at 7:04 PM Chris Palmer < > > ch...@crpalmer.com> > > wrote: > > > > Is there a reason why we can't use symlinks to have copies > > of the > > files > > show up in both subpackages? So that `gcs_to_s3.py` would be > > under > > both > > `aws/operators/` and `gcp/operators`. I could imagine there > > may > > be > > technical reasons why this is a bad idea, but just thought I > > would > > ask. > > > > Symlinks is not supported by git. > > > > > > Why do you say that? This blog post > > <https://www.mokacoding.com/blog/symliks-in-git/> details how > > you > > can > > use > > them, and the caveats with regards to needing relative links not > > absolute. > > The example repo he links to at the end includes a symlink which > > worked > > fine for me when I cloned it. But maybe not relevant given the > > below: > > > > We still have to check if python packages can have links, but I'm > > afraid of this mechanism. This is not popular and may cause > > unexpected > > consequences. > > > > > > Likewise, someone who spends 99% of their time working in > > AWS and > > using > > all > > the operators in that subpackage, might not think to look in > > the > > GCP > > package the first time they need a GCS to S3 operator. I'm > > admittedly > > terrible at documentation, but if duplicating the files via > > symlinks > > isn't > > an option, then is there an easy way we could duplicate the > > documentation > > for those operators so they are easily findable in both doc > > sections? > > > > > > Recently, I updated the documentation: > > https://airflow.readthedocs.io/en/latest/integration.html > > We have list of all integration in AWS, Azure, GCP. If the > > operator > > concerns two cloud proivders, it repeats in two places. It's > > good > > for > > documentation. DRY rule is only valid for source code. > > I am working on documentation for other operators. > > My work is part of this ticket: > > https://issues.apache.org/jira/browse/AIRFLOW-5431 > > > > > > This updated documentation looks great, definitely heading in a > > direction > > that makes it easier and addresses my concerns. (Although it > > took me > > a > > while to realize those tables can be scrolled horizontally!). > > > > I'm working on redesign of documentation theme. It's part of AIP-11 > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow > > We are currently at the stage of collecting comments from the first > > phase - we sent materials to the community, but also conducted > > tests > > with real users > > > > > > > > > > > > > https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> >