Thanks Jarek for clearing that up.

Personally I would omit the Apache one. We should not step into the
fallacy as before with not being sure if it was in contrib or not. I would
even consider merging software and protocols, as it not entirely clear what
a protocol is or not. In the end, everything is a protocol, might be a high
level (FTP) or a low-level protocol (FS).

Cheers, Fokko

Cheers, Fokko

Op di 29 okt. 2019 om 12:45 schreef Jarek Potiuk <jarek.pot...@polidea.com>:

> Yep. We should definitely discuss the split!
>
> For me these are the criteria:
>
>    - fundamentals - those are all the operators/hooks/sensors that are the
>    "Core" of Airflow (base, dbapi) and allow you to run basic examples,
>    implements basic logic of  Airflow (subdags, branch etc.) + generic
>    operators being base for others (like generic transfer/sql)
>    - providers - integration with cloud providers - (PAAS)
>    - apache - integrations - with other ApacheSoftwareFoundation projects
>    - software - Integration with other software that is proprietary or
>    open-source that you can install on-premises (or in the cloud)
>    - protocols - integration with protocols that can be implemented by any
>    software (SFTP/mail/etc.)
>    - services - Integration with SAAS solutions
>
> From the above list I only have doubts about the "apache" one - question is
> whether as part of Apache Community we want to somehow group those.
>
> J.
>
>
> On Tue, Oct 29, 2019 at 11:19 AM Bas Harenslak <
> basharens...@godatadriven.com> wrote:
>
> >   1.  Sounds good to me
> >   2.  Also fine
> >   3.  We should have some consensus here. E.g. I’m not sure what groups
> > “fundamentals” and “software” are meant to be :-)
> >
> > While we’re at it: we should really move the BaseOperator out of models.
> > The BaseOperator has no representation in the DB and should be placed
> > together with other scripts where it belongs, i.e. something like
> > airflow.operators.base_operator.
> >
> > Bas
> >
> > On 29 Oct 2019, at 10:43, Jarek Potiuk <jarek.pot...@polidea.com<mailto:
> > jarek.pot...@polidea.com>> wrote:
> >
> > After some consideration and seeing the actual move in practice I wanted
> to
> > propose 3rd amendment ;) to the AIP-21.
> > I have a few observations from seeing the discussions and observing the
> > actual moving process. I have the following proposals:
> >
> > *1) Between-providers transfer operators should be kept at the "target"
> > rather than "source"*
> >
> > If we end up with splitting operators by groups (AIP-8 and the proposed
> > Backporting to Airflow 1.10), I think it makes more sense to keep
> transfer
> > operators in the "target" package. For example "S3 to GCS" operator in
> > "providers/google" package - simply because it is more likely that the
> > individuals that will be working on the pure "GCP" services will also be
> > more interested in getting the data from other cloud providers, and
> likely
> > they will even have some transfer services that can be used for that
> > purpose (rather than using worker to transfer the data) - in the
> particular
> > S3-> GCS case we have GCP's
> > https://cloud.google.com/storage-transfer/docs/overview which allows to
> > transfer data from any other cloud provider to GCS . Same for example if
> we
> > imagine Athena -> Bigquery for example. At least that's the feeling I
> have.
> > I can imagine that the kind of "stewardship" over those groups of
> operators
> > can be somewhat influenced and maybe even performed by those cloud
> > providers themselves. Corresponding hooks of course should be in
> different
> > "groups".
> >
> > 2) *One-side provider-neutral transfer operators should be kept at the
> > "provider" regardless if they are target or source.*
> >
> > For example GCS-> SFTP or SFTP -> GCS. There the hook for SFTP should be
> in
> > the "core" package but both operators should be in "providers/google".
> The
> > reason is quite the same as above - the "stewardship" over all the
> > operators can be done by the "provider" group.
> >
> > *3) Grouping non-provider operators/hooks according to their purpose.*
> >
> > I think it is also the right time to move the other operators/hooks to
> > different groups within core. We already have some reasonable and nice
> > groups proposed in the new documentation by Kamil
> > https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html
> and
> > it only makes sense to move those now (Fundamentals, ASF: Apache Software
> > Foundation, Azure: Microsoft Azure, AWS: Amazon Web Services, GCP: Google
> > Cloud Platform, Service integrations, Software integrations, Protocol
> > integrations). I think it would make sense to use the same approach in
> the
> > code: We could have
> >
> >
> fundamentals/asf/azure(microsoft/azure?)/aws(amazon/aws?)/google/services/software/protocols)
> > packages.
> >
> > There will be few exceptions probably but we can handle them on
> > case-by-case basis.
> >
> > J.
> >
> > On Fri, Oct 11, 2019 at 3:11 PM Jarek Potiuk <jarek.pot...@polidea.com
> > <mailto:jarek.pot...@polidea.com>>
> > wrote:
> >
> > Hello everyone. I updated AIP-21 and updated examples.
> >
> >
> > Point D. of AIP-21 is now as follows:
> >
> >
> >
> > *D. * Group operators/sensors/hooks in
> > *airflow/providers/<PROVIDER>*/operators(sensors,
> > hooks).
> >
> > Each provider can define its own internal structure of that package. For
> > example in case of "google" provider the packages will be further grouped
> > by "gcp", "gsuite", "core" sub-packages.
> >
> > In case of transfer operators where two providers are involved, the
> > transfer operators will be moved to "source" of the transfer. When there
> > is only one provider as target but source is a database or another
> > non-provider source, the operator is put to the target provider.
> >
> > Non-cloud provider ones are moved to airflow/operators(sensors/hooks).
> > *Drop the prefix.*
> >
> > Examples:
> >
> > AWS operator:
> >
> >   -
> > *airflow/contrib/operators/sns_publish_operator.py
> >   becomes airflow/providers/aws/operators/**sns_publish_operator.py*
> >
> > *Google GCP operator:*
> >
> >   - *airflow/contrib/operators/dataproc_operator.py*
> >  becomes *airflow/providers/gooogle/gcp/operators/dataproc_operator.py*
> >
> > Previously GCP-prefixed operator:
> >
> >   -
> > *airflow/contrib/operators/gcp_bigtable_operator.py  *becomes
> >   *airflow/providers/google/**gcp/operators/bigtable_operator.py*
> >
> > *Transfer from GCP:*
> >
> >   - *airflow/contrib/operators/gcs_to_s3_operator.py*
> >   * becomes airflow/providers/google/gcp/operators/gcs_to_s3_operator.py*
> >
> > *MySQL to GCS:*
> >
> >   - *airflow/contrib/operators/mysql_to_gcs_operator.py*
> >   * becomes airflow/providers/google/gcp/operators/*
> >   *mysql_to_gcs_operator.py*
> >
> > *SSH operator:*
> >
> >   -
> > *airflow/contrib/operators/ssh_operator.py  *becomes *airflow/*
> >   *operators/ssh_operator.py*
> >
> >
> > On Fri, Oct 4, 2019 at 6:22 PM Jarek Potiuk <jarek.pot...@polidea.com
> > <mailto:jarek.pot...@polidea.com>>
> > wrote:
> >
> > Yeah. I think the important point is that the latest doc changes by Kamil
> > index all available operators and hooks nicely and make them easy to
> find.
> >
> > That also includes (as of today) automated CI checking if new operators
> > and hooks added are added to the documentation :
> >
> >
> https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13
> >
> > J.
> >
> > On Fri, Oct 4, 2019 at 5:21 PM Chris Palmer <ch...@crpalmer.com> wrote:
> >
> > It's not obvious to me why an S3ToMsSQLOperator in the aws package is
> > "silly". Why do you say it made sense to create a MsSqlFromS3Operator?
> >
> > Basically all of these operators could be thought of as "move data from A
> > to B" or "move data to B from A". I think what feels natural to each
> > individual will depend on what their frame of reference is, and where
> > their
> > main focus is. If you are largely focused on MsSql then I can understand
> > that it's natural to think "What MsSql operators are there?" and to
> > not see S3ToMsSqlOperator
> > as one of those MsSql operators. That's exactly the point I made with my
> > earlier response; I was so focused on BigQuery that I didn't think to
> > look
> > under Cloud Storage documentation for the
> > GoogleCloudStorageToBigQueryOperator.
> >
> > I think it is too hard to draw a very distinct line between what is just
> > "storage" and what is more. There are going to be fuzzy edge cases, so
> > picking a single convention is going to much less hassle in my view. As
> > long as that convention is well documented and the documentation is
> > improved so that it's easier to find all operators that relate to
> > BigQuery
> > or MsSql etc in one place (as is being done by Kamil) then that is the
> > best
> > we can do.
> >
> > Chris
> >
> >
> >
> > On Fri, Oct 4, 2019 at 10:55 AM Daniel Standish <dpstand...@gmail.com>
> > wrote:
> >
> > One case popped up for us recently, where it made sense to make a MsSql
> > *From*S3Operator .
> >
> > I think using "source" makes sense in general, but in this case calling
> > this a S3ToMsSqlOperator and putting it under AWS seems silly, even
> > though
> > you could say s3 is "source" here.
> >
> > I think in most of these cases we say "let's use source" because
> > source is
> > where the actual work is done and destination is just storage.
> >
> > Does a guideline saying "ignore storage" or "storage is secondary in
> > object
> > location" make sense?
> >
> >
> >
> > On Fri, Oct 4, 2019 at 6:42 AM Jarek Potiuk <jarek.pot...@polidea.com>
> > wrote:
> >
> > It looks like we have general consensus about putting transfer
> > operators
> > into "source provider" package.
> > That's great for me as well.
> >
> > Since I will be updating AIP-21 to reflect the "google" vs. "gcp"
> > case, I
> > will also update it to add this decision.
> >
> > If no-one objects (Lazy Consensus
> > <https://community.apache.org/committers/lazyConsensus.html>) till
> > Monday7th of October, 3.20 CEST, we will update AIP-21 with
> > information
> > that transfer operators should be placed in the "source" provider
> > module.
> >
> > J.
> >
> > On Tue, Sep 24, 2019 at 1:34 PM Kamil Breguła <
> > kamil.breg...@polidea.com
> >
> > wrote:
> >
> > On Mon, Sep 23, 2019 at 7:42 PM Chris Palmer <ch...@crpalmer.com>
> > wrote:
> >
> > On Mon, Sep 23, 2019 at 1:22 PM Kamil Breguła <
> > kamil.breg...@polidea.com
> >
> > wrote:
> >
> > On Mon, Sep 23, 2019 at 7:04 PM Chris Palmer <
> > ch...@crpalmer.com>
> > wrote:
> >
> > Is there a reason why we can't use symlinks to have copies
> > of the
> > files
> > show up in both subpackages? So that `gcs_to_s3.py` would be
> > under
> > both
> > `aws/operators/` and `gcp/operators`. I could imagine there
> > may
> > be
> > technical reasons why this is a bad idea, but just thought I
> > would
> > ask.
> >
> > Symlinks is not supported by git.
> >
> >
> > Why do you say that? This blog post
> > <https://www.mokacoding.com/blog/symliks-in-git/> details how
> > you
> > can
> > use
> > them, and the caveats with regards to needing relative links not
> > absolute.
> > The example repo he links to at the end includes a symlink which
> > worked
> > fine for me when I cloned it. But maybe not relevant given the
> > below:
> >
> > We still have to check if python packages can have links, but I'm
> > afraid of this mechanism. This is not popular and may cause
> > unexpected
> > consequences.
> >
> >
> > Likewise, someone who spends 99% of their time working in
> > AWS and
> > using
> > all
> > the operators in that subpackage, might not think to look in
> > the
> > GCP
> > package the first time they need a GCS to S3 operator. I'm
> > admittedly
> > terrible at documentation, but if duplicating the files via
> > symlinks
> > isn't
> > an option, then is there an easy way we could duplicate the
> > documentation
> > for those operators so they are easily findable in both doc
> > sections?
> >
> >
> > Recently, I updated the documentation:
> > https://airflow.readthedocs.io/en/latest/integration.html
> > We have list of all integration in AWS, Azure, GCP.  If the
> > operator
> > concerns two cloud proivders, it repeats in two places. It's
> > good
> > for
> > documentation.  DRY rule is only valid for source code.
> > I am working on documentation for other operators.
> > My work is part of this ticket:
> > https://issues.apache.org/jira/browse/AIRFLOW-5431
> >
> >
> > This updated documentation looks great, definitely heading in a
> > direction
> > that makes it easier and addresses my concerns. (Although it
> > took me
> > a
> > while to realize those tables can be scrolled horizontally!).
> >
> > I'm working on redesign of documentation theme. It's part of AIP-11
> >
> >
> >
> >
> >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow
> > We are currently at the stage of collecting comments from the first
> > phase - we sent materials to the community, but also conducted
> > tests
> > with real users
> >
> >
> >
> >
> >
> >
> https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E
> >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
> >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Reply via email to