On a side note, regarding "Some operators for Google services that are in Airflow 1.10 have bugs that make it difficult or impossible to use them."
which ones are these operators, we should definitely fix any bugs that are in Airflow 1.10.10 We should definitely release backport packages but we should also fix the bugs for the operators in Airflow 1.10.* code. Regards, Kaxil On Mon, Apr 20, 2020 at 2:01 PM Ash Berlin-Taylor <[email protected]> wrote: > You can prove just about anything with statistics. ;) There may just be > two packcages under `http` but it is likely much more frequently used > than some of the products in the Google suite. > > More seriously: I feel very unhappy about the ideas of just releasing > the backport of the Google operators without all the rest because of how > it looks - Apache projects are meant to be independent of any > organisation after all https://www.apache.org/theapacheway/index.html: > > Two options spring to mind: > > 1. Google donates some time to write system tests for the rest of the > operators. > 2. We release them all with just unit tests. > > After all option 2 is _exactly what we do now for releases_. Either > we're happy the unit tests cover things already, or we shouldn't be > making any releases. (Of anything, including the main airflow package) > > On a slightly different subject: have we though how/what we are going to > version these packages? > > Are we going for (say) apache-airflow-provider-google==2.0.0, or perhaps > apache-airflow-provider-google==1.99 to mirror what grub2 did for a while. > > And an important question I think we need to answer before we publish > these: What happens to these packages once Airflow 2.0 is out? (Mostly > just how to we avoid any installation problems for our users in the > future. Whether these packages live on or not can be addressed in AIP-8?) > > -ash > > > On Apr 20 2020, at 12:29 pm, Jarek Potiuk <[email protected]> > wrote: > > > Great stats Kamil :). I have not realized there is so big imbalance > > when it comes to the number of operators :). > > > > I fully agree 66% +sounds like great value. And the stats tell me that > > maybe we are not that far away from testing everything :) > > > > J. > > > > On Mon, Apr 20, 2020 at 1:01 PM Kamil Breguła > > <[email protected]> wrote: > >> > >> Hello, > >> > >> Thanks Jarek, that you deal with this topic. It is very important for > >> our users. Many users want to use new operators, but this is not > >> possible. > >> > >> In my opinion, we should not only look at the package name, but their > >> content is more important. We should base our decisions on hard data. > >> For this reason, I have prepared some statistics. I counted how many > >> operators are in each package. > >> > >> 298 google > >> 49 amazon > >> 27 apache > >> 13 microsoft > >> 6 yandex > >> 6 qubole > >> 4 mysql > >> 3 slack > >> 3 redis > >> 3 jira > >> 3 cncf > >> 2 snowflake > >> 2 sftp > >> 2 salesforce > >> 2 oracle > >> 2 http > >> 2 ftp > >> 2 docker > >> 2 databricks > >> 1 vertica > >> 1 ssh > >> 1 sqlite > >> 1 singularity > >> 1 segment > >> 1 postgres > >> 1 papermill > >> 1 opsgenie > >> 1 mongo > >> 1 jenkins > >> 1 jdbc > >> 1 imap > >> 1 grpc > >> 1 exasol > >> 1 email > >> 1 discord > >> 1 dingding > >> 1 datadog > >> 1 celery > >> > >> So we have > >> 298 operators in google package (66% of total) > >> 152 operators in other packages > >> > >> Here is a list of all operators in Airflow master: > https://pastebin.com/GyARtGRC > >> To generate statistics I use the following command: > >> cat list-all.txt | grep providers | cut -d "." -f 3 | sort -n | uniq > >> -c | sort -n -r > >> cat list-all.txt | grep providers | cut -d "." -f 3 | sort -n | uniq > >> -c | sort -n -r | grep google | awk '{sum += $1} END {print sum}' > >> cat list-all.txt | grep providers | cut -d "." -f 3 | sort -n | uniq > >> -c | sort -n -r | grep -v google | awk '{sum += $1} END {print sum}' > >> > >> Now we can ask another question - should we release packages with 66+% > >> operators? If not, what percentage will be appropriate? > >> > >> In my opinion, we should release tested packages as soon as possible. > >> This allows users to become better acquainted with this idea, and in > >> the long run, encourage more people to test other services as well. > >> > >> Some operators for Google services that are in Airflow 1.10 have bugs > >> that make it difficult or impossible to use them. Many operators have > >> also never been released in any Airflow 1.10 release Many users write > >> to me who want to use Airflow 2.0 operators and I don't have good news > >> for them. If I can't solve all the problems then I would like to be > >> able to solve the problem only for a few people, but don't stay in one > >> place. Users expect that they will be able to use these operators now, > >> so if there are no technical obstacles then we should do it as soon as > >> possible. > >> > >> Best regards, > >> Kamil > >> > >> On Mon, Apr 20, 2020 at 10:06 AM Jarek Potiuk > >> <[email protected]> wrote: > >> > > >> > I would like to focus this week on releasing backport packages. And I > >> > would like to ask you for opinions on what should be the first "bunch > >> > of packages" to release: > >> > > >> > The current status snapshot is here: > >> > > https://cwiki.apache.org/confluence/display/AIRFLOW/Backported+providers+packages+for+Airflow+1.10.*+series > >> > > >> > We have a project in Github: > >> > https://github.com/apache/airflow/projects/2 where I keep the status > >> > of the packages and if you drill down to issues you will see that we > >> > have very well defined criteria for each of the packages to be > >> > "ready-to-release". > >> > > >> > I think adding system tests and actual testing is a slow process. We > >> > completed it for "google" "Postgres" "MySQL" packages and I am > >> > planning to complete it for "HTTP" - possibly few simpler ones like > >> > "sftp" "ssh" myself this week. We also need to re-test it for 1.10.10 > >> > but since we have semi-automated system tests, it will be easy and I > >> > might even be able to automate it with Github Actions. > >> > > >> > However, the two important ones "Microsoft" and "Amazon" are still > >> > quite far from completion (or even starting for "Microsoft"). > >> > > >> > I might try to engage more people to do the testing, but I think there > >> > also might be a value in releasing some first packages so that people > >> > start using them and maybe then this will be a bigger incentive to do > >> > more testing and implement system tests for other packages. > >> > > >> > I think about two scenarios of release: > >> > > >> > 1) Google + postgres + mysql + http + ssh +sftp > >> > > >> > 2) Same as above but we wait for "amazon" "microsoft" to complete > >> > > >> > What do you think - should we release the first bunch of operators > >> > now? I personally think we should do that. > >> > > >> > J. > >> > > >> > > >> > > >> > -- > >> > Jarek Potiuk > >> > Polidea | Principal Software Engineer > >> > > >> > M: +48 660 796 129 > > > > > > > > -- > > > > Jarek Potiuk > > Polidea | Principal Software Engineer > > > > M: +48 660 796 129 > > >
