Yep. We should definitely discuss the split!

For me these are the criteria:

   - fundamentals - those are all the operators/hooks/sensors that are the
   "Core" of Airflow (base, dbapi) and allow you to run basic examples,
   implements basic logic of  Airflow (subdags, branch etc.) + generic
   operators being base for others (like generic transfer/sql)
   - providers - integration with cloud providers - (PAAS)
   - apache - integrations - with other ApacheSoftwareFoundation projects
   - software - Integration with other software that is proprietary or
   open-source that you can install on-premises (or in the cloud)
   - protocols - integration with protocols that can be implemented by any
   software (SFTP/mail/etc.)
   - services - Integration with SAAS solutions

>From the above list I only have doubts about the "apache" one - question is
whether as part of Apache Community we want to somehow group those.

J.


On Tue, Oct 29, 2019 at 11:19 AM Bas Harenslak <
basharens...@godatadriven.com> wrote:

>   1.  Sounds good to me
>   2.  Also fine
>   3.  We should have some consensus here. E.g. I’m not sure what groups
> “fundamentals” and “software” are meant to be :-)
>
> While we’re at it: we should really move the BaseOperator out of models.
> The BaseOperator has no representation in the DB and should be placed
> together with other scripts where it belongs, i.e. something like
> airflow.operators.base_operator.
>
> Bas
>
> On 29 Oct 2019, at 10:43, Jarek Potiuk <jarek.pot...@polidea.com<mailto:
> jarek.pot...@polidea.com>> wrote:
>
> After some consideration and seeing the actual move in practice I wanted to
> propose 3rd amendment ;) to the AIP-21.
> I have a few observations from seeing the discussions and observing the
> actual moving process. I have the following proposals:
>
> *1) Between-providers transfer operators should be kept at the "target"
> rather than "source"*
>
> If we end up with splitting operators by groups (AIP-8 and the proposed
> Backporting to Airflow 1.10), I think it makes more sense to keep transfer
> operators in the "target" package. For example "S3 to GCS" operator in
> "providers/google" package - simply because it is more likely that the
> individuals that will be working on the pure "GCP" services will also be
> more interested in getting the data from other cloud providers, and likely
> they will even have some transfer services that can be used for that
> purpose (rather than using worker to transfer the data) - in the particular
> S3-> GCS case we have GCP's
> https://cloud.google.com/storage-transfer/docs/overview which allows to
> transfer data from any other cloud provider to GCS . Same for example if we
> imagine Athena -> Bigquery for example. At least that's the feeling I have.
> I can imagine that the kind of "stewardship" over those groups of operators
> can be somewhat influenced and maybe even performed by those cloud
> providers themselves. Corresponding hooks of course should be in different
> "groups".
>
> 2) *One-side provider-neutral transfer operators should be kept at the
> "provider" regardless if they are target or source.*
>
> For example GCS-> SFTP or SFTP -> GCS. There the hook for SFTP should be in
> the "core" package but both operators should be in "providers/google". The
> reason is quite the same as above - the "stewardship" over all the
> operators can be done by the "provider" group.
>
> *3) Grouping non-provider operators/hooks according to their purpose.*
>
> I think it is also the right time to move the other operators/hooks to
> different groups within core. We already have some reasonable and nice
> groups proposed in the new documentation by Kamil
> https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html and
> it only makes sense to move those now (Fundamentals, ASF: Apache Software
> Foundation, Azure: Microsoft Azure, AWS: Amazon Web Services, GCP: Google
> Cloud Platform, Service integrations, Software integrations, Protocol
> integrations). I think it would make sense to use the same approach in the
> code: We could have
>
> fundamentals/asf/azure(microsoft/azure?)/aws(amazon/aws?)/google/services/software/protocols)
> packages.
>
> There will be few exceptions probably but we can handle them on
> case-by-case basis.
>
> J.
>
> On Fri, Oct 11, 2019 at 3:11 PM Jarek Potiuk <jarek.pot...@polidea.com
> <mailto:jarek.pot...@polidea.com>>
> wrote:
>
> Hello everyone. I updated AIP-21 and updated examples.
>
>
> Point D. of AIP-21 is now as follows:
>
>
>
> *D. * Group operators/sensors/hooks in
> *airflow/providers/<PROVIDER>*/operators(sensors,
> hooks).
>
> Each provider can define its own internal structure of that package. For
> example in case of "google" provider the packages will be further grouped
> by "gcp", "gsuite", "core" sub-packages.
>
> In case of transfer operators where two providers are involved, the
> transfer operators will be moved to "source" of the transfer. When there
> is only one provider as target but source is a database or another
> non-provider source, the operator is put to the target provider.
>
> Non-cloud provider ones are moved to airflow/operators(sensors/hooks).
> *Drop the prefix.*
>
> Examples:
>
> AWS operator:
>
>   -
> *airflow/contrib/operators/sns_publish_operator.py
>   becomes airflow/providers/aws/operators/**sns_publish_operator.py*
>
> *Google GCP operator:*
>
>   - *airflow/contrib/operators/dataproc_operator.py*
>  becomes *airflow/providers/gooogle/gcp/operators/dataproc_operator.py*
>
> Previously GCP-prefixed operator:
>
>   -
> *airflow/contrib/operators/gcp_bigtable_operator.py  *becomes
>   *airflow/providers/google/**gcp/operators/bigtable_operator.py*
>
> *Transfer from GCP:*
>
>   - *airflow/contrib/operators/gcs_to_s3_operator.py*
>   * becomes airflow/providers/google/gcp/operators/gcs_to_s3_operator.py*
>
> *MySQL to GCS:*
>
>   - *airflow/contrib/operators/mysql_to_gcs_operator.py*
>   * becomes airflow/providers/google/gcp/operators/*
>   *mysql_to_gcs_operator.py*
>
> *SSH operator:*
>
>   -
> *airflow/contrib/operators/ssh_operator.py  *becomes *airflow/*
>   *operators/ssh_operator.py*
>
>
> On Fri, Oct 4, 2019 at 6:22 PM Jarek Potiuk <jarek.pot...@polidea.com
> <mailto:jarek.pot...@polidea.com>>
> wrote:
>
> Yeah. I think the important point is that the latest doc changes by Kamil
> index all available operators and hooks nicely and make them easy to find.
>
> That also includes (as of today) automated CI checking if new operators
> and hooks added are added to the documentation :
>
> https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13
>
> J.
>
> On Fri, Oct 4, 2019 at 5:21 PM Chris Palmer <ch...@crpalmer.com> wrote:
>
> It's not obvious to me why an S3ToMsSQLOperator in the aws package is
> "silly". Why do you say it made sense to create a MsSqlFromS3Operator?
>
> Basically all of these operators could be thought of as "move data from A
> to B" or "move data to B from A". I think what feels natural to each
> individual will depend on what their frame of reference is, and where
> their
> main focus is. If you are largely focused on MsSql then I can understand
> that it's natural to think "What MsSql operators are there?" and to
> not see S3ToMsSqlOperator
> as one of those MsSql operators. That's exactly the point I made with my
> earlier response; I was so focused on BigQuery that I didn't think to
> look
> under Cloud Storage documentation for the
> GoogleCloudStorageToBigQueryOperator.
>
> I think it is too hard to draw a very distinct line between what is just
> "storage" and what is more. There are going to be fuzzy edge cases, so
> picking a single convention is going to much less hassle in my view. As
> long as that convention is well documented and the documentation is
> improved so that it's easier to find all operators that relate to
> BigQuery
> or MsSql etc in one place (as is being done by Kamil) then that is the
> best
> we can do.
>
> Chris
>
>
>
> On Fri, Oct 4, 2019 at 10:55 AM Daniel Standish <dpstand...@gmail.com>
> wrote:
>
> One case popped up for us recently, where it made sense to make a MsSql
> *From*S3Operator .
>
> I think using "source" makes sense in general, but in this case calling
> this a S3ToMsSqlOperator and putting it under AWS seems silly, even
> though
> you could say s3 is "source" here.
>
> I think in most of these cases we say "let's use source" because
> source is
> where the actual work is done and destination is just storage.
>
> Does a guideline saying "ignore storage" or "storage is secondary in
> object
> location" make sense?
>
>
>
> On Fri, Oct 4, 2019 at 6:42 AM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> It looks like we have general consensus about putting transfer
> operators
> into "source provider" package.
> That's great for me as well.
>
> Since I will be updating AIP-21 to reflect the "google" vs. "gcp"
> case, I
> will also update it to add this decision.
>
> If no-one objects (Lazy Consensus
> <https://community.apache.org/committers/lazyConsensus.html>) till
> Monday7th of October, 3.20 CEST, we will update AIP-21 with
> information
> that transfer operators should be placed in the "source" provider
> module.
>
> J.
>
> On Tue, Sep 24, 2019 at 1:34 PM Kamil Breguła <
> kamil.breg...@polidea.com
>
> wrote:
>
> On Mon, Sep 23, 2019 at 7:42 PM Chris Palmer <ch...@crpalmer.com>
> wrote:
>
> On Mon, Sep 23, 2019 at 1:22 PM Kamil Breguła <
> kamil.breg...@polidea.com
>
> wrote:
>
> On Mon, Sep 23, 2019 at 7:04 PM Chris Palmer <
> ch...@crpalmer.com>
> wrote:
>
> Is there a reason why we can't use symlinks to have copies
> of the
> files
> show up in both subpackages? So that `gcs_to_s3.py` would be
> under
> both
> `aws/operators/` and `gcp/operators`. I could imagine there
> may
> be
> technical reasons why this is a bad idea, but just thought I
> would
> ask.
>
> Symlinks is not supported by git.
>
>
> Why do you say that? This blog post
> <https://www.mokacoding.com/blog/symliks-in-git/> details how
> you
> can
> use
> them, and the caveats with regards to needing relative links not
> absolute.
> The example repo he links to at the end includes a symlink which
> worked
> fine for me when I cloned it. But maybe not relevant given the
> below:
>
> We still have to check if python packages can have links, but I'm
> afraid of this mechanism. This is not popular and may cause
> unexpected
> consequences.
>
>
> Likewise, someone who spends 99% of their time working in
> AWS and
> using
> all
> the operators in that subpackage, might not think to look in
> the
> GCP
> package the first time they need a GCS to S3 operator. I'm
> admittedly
> terrible at documentation, but if duplicating the files via
> symlinks
> isn't
> an option, then is there an easy way we could duplicate the
> documentation
> for those operators so they are easily findable in both doc
> sections?
>
>
> Recently, I updated the documentation:
> https://airflow.readthedocs.io/en/latest/integration.html
> We have list of all integration in AWS, Azure, GCP.  If the
> operator
> concerns two cloud proivders, it repeats in two places. It's
> good
> for
> documentation.  DRY rule is only valid for source code.
> I am working on documentation for other operators.
> My work is part of this ticket:
> https://issues.apache.org/jira/browse/AIRFLOW-5431
>
>
> This updated documentation looks great, definitely heading in a
> direction
> that makes it easier and addresses my concerns. (Although it
> took me
> a
> while to realize those tables can be scrolled horizontally!).
>
> I'm working on redesign of documentation theme. It's part of AIP-11
>
>
>
>
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow
> We are currently at the stage of collecting comments from the first
> phase - we sent materials to the community, but also conducted
> tests
> with real users
>
>
>
>
>
> https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E
>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>
>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to