Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-05 Thread Daniel Imberman
Worth noting that git has the ability to cherry-pick only specific directories. 
If we keep all of helm + tests in one directory, docker + tests in another, and 
core + tests in a third directory it would be pretty simple to automate 
splitting them.

https://stackoverflow.com/questions/19821749/git-cherry-pick-or-merge-specific-directory-from-another-branch

via Newton Mail 
[https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2]
On Sun, Jul 5, 2020 at 9:57 AM, Daniel Imberman  
wrote:
I can’t agree with this enough :). I think writing a few bots to separate out 
sections will be MUCH easier in the long run than maintaining multiple repos. 
Will also prevent the difficulty of setting up a proper dev environment for new 
contributors.
via Newton Mail 
[https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2]
On Sun, Jul 5, 2020 at 9:53 AM, Jarek Potiuk  wrote:
Yeah. I think that the "monorepo" is the only way for now - until (or if)
we reach the size (and maturity) that different teams take care of the
different projects. Which might even not happen.

But I would love to try the separate repos to publish/release still (maybe
not immediately, but it is a nice concept). I think it should be rather
easy (I will try it on my own repo first). Also, I think it has another
advantage - those separate repos might actually run other kinds of tests -
for example, to test if there is "everything" in that repo to release it
(for example build helm chart) and whether there are no accidental use of
stuff from outside of those dirs.

I already thought about how to do it - it should be rather easy. Of course
- like most of the time - there is a ready-to-use git command doing it for
us. We simply need a bot running for that rep executing a variant of this
command:
https://docs.github.com/en/github/using-git/splitting-a-subfolder-out-into-a-new-repository
(it
should only take commits from the commit merged last time). So level of
automation here is rather minimal.

And if have those repos and at some point of time we decide to split
eventually - we will have already repos with all history as a starting
point.

J.







J.


On Sun, Jul 5, 2020 at 4:42 PM Kaxil Naik  wrote:

> Hmm.. I agree the git-sync would have been a difficult one to solve if we
> had separate repositories.
>
> Well, in that case, the mono repo approach (like we have now) indeed makes
> more sense.
>
> Regarding the Kubernetes approach, I feel the ones in staging (
> https://github.com/kubernetes/kubernetes/tree/master/staging) are part of
> the actual product itself but in our case we were discussing between Helm
> chart and Dockerfile which are not actually part of the product. And we
> will need a good deal of automation if we go down that route.
> I think the plain mono-repo approach is better than that one.
>
> Regards,
> Kaxil
>
>
> On Sun, Jul 5, 2020 at 9:19 AM Jarek Potiuk 
> wrote:
>
> > And one more perfect illustration of what I am talking about.
> >
> > A very good thing just happened. I was running the PR while writing the
> > email (long time as you might imagine) and the new K8S tests with 1.10.11
> > just failed. https://github.com/apache/airflow/pull/9663
> >
> > If had released the helm chart before we would've clear (small)
> > incompatibility here. And by seeing the test failing we could make
> decision
> > what to do:
> >
> > 1) fix it differently
> > 2) document it as a breaking Helm change, "1.10.12+ image" and make test
> > work in both cases
> > 3) revert ...
> >
> > But at least we have na early warning that something is wrong. This is
> the
> > clear value of running the tests at every commit.
> >
> > J.
> >
> > On Sun, Jul 5, 2020 at 10:08 AM Jarek Potiuk 
> > wrote:
> >
> > > I just have another example of a case where splitting the repos and
> using
> > > only "released versions" across repositories might be a complete
> overkill
> > > when it comes to development complexity.
> > >
> > > We have this change from Aneesh:
> > > https://github.com/apache/airflow/pull/9371 about adding a git-sync
> > > option to the helm chart.
> > >
> > > That's a new feature, but we would like to test both 1.10 and the
> master
> > > version of KubernetesExecutor with that. It should work for both of
> them
> > -
> > > there is no coupling/dependency in the "airflow' code for it.
> > >
> > > However, there is a strong coupling in the tests. We have the
> > > "kubernetes_tests" running tests using all three: chart, production
> > docker,
> > > and Airflow, Those tests will have to be likely adapted to work with
> the
> > > new git-sync option. They were disabled previously as we had problems
> > with
> > > them before the helm chart was used for tests but we can turn them back
> > on
> > > now when git-sync is added to the helm chart. Those tests are part of
> > > airflow test suite and we discussed with Daniel that they should stay
> > there
> > > - those tests are importing airflow code, they are using 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-05 Thread Jarek Potiuk
Yeah. I think that the "monorepo" is the only way for now - until (or if)
we reach the size (and maturity) that different teams take care of the
different projects. Which might even not happen.

But I would love to try the separate repos to publish/release still (maybe
not immediately, but it is a nice concept). I think it should be rather
easy (I will try it on my own repo first). Also, I think it has another
advantage - those separate repos might actually run other kinds of tests -
for example, to test if there is "everything" in that repo to release it
(for example build helm chart) and whether there are no accidental use of
stuff from outside of those dirs.

I already thought about how to do it - it should be rather easy. Of course
- like most of the time - there is a ready-to-use git command doing it for
us. We simply need a bot running for that rep executing a variant of this
command:
https://docs.github.com/en/github/using-git/splitting-a-subfolder-out-into-a-new-repository
(it
should only take commits from the commit merged last time). So level of
automation here is rather minimal.

And if have those repos and at some point of time we decide to split
eventually - we will have already repos with all history as a starting
point.

J.







J.


On Sun, Jul 5, 2020 at 4:42 PM Kaxil Naik  wrote:

> Hmm.. I agree the git-sync would have been a difficult one to solve if we
> had separate repositories.
>
> Well, in that case, the mono repo approach (like we have now) indeed makes
> more sense.
>
> Regarding the Kubernetes approach, I feel the ones in staging (
> https://github.com/kubernetes/kubernetes/tree/master/staging) are part of
> the actual product itself but in our case we were discussing between Helm
> chart and Dockerfile which are not actually part of the product. And we
> will need a good deal of automation if we go down that route.
> I think the plain mono-repo approach is better than that one.
>
> Regards,
> Kaxil
>
>
> On Sun, Jul 5, 2020 at 9:19 AM Jarek Potiuk 
> wrote:
>
> > And one more perfect illustration of what I am talking about.
> >
> > A very good thing just happened. I was running the PR while writing the
> > email (long time as you might imagine) and the new K8S tests with 1.10.11
> > just failed. https://github.com/apache/airflow/pull/9663
> >
> > If had released the helm chart before we would've clear (small)
> > incompatibility here. And by seeing the test failing we could make
> decision
> > what to do:
> >
> > 1) fix it differently
> > 2) document it as a breaking Helm change,  "1.10.12+ image" and make test
> > work in both cases
> > 3) revert ...
> >
> > But at least we have na early warning that something is wrong. This is
> the
> > clear value of running the tests at every commit.
> >
> > J.
> >
> > On Sun, Jul 5, 2020 at 10:08 AM Jarek Potiuk 
> > wrote:
> >
> > > I just have another example of a case where splitting the repos and
> using
> > > only "released versions" across repositories might be a complete
> overkill
> > > when it comes to development complexity.
> > >
> > > We have this change from Aneesh:
> > > https://github.com/apache/airflow/pull/9371 about adding a git-sync
> > > option to the helm chart.
> > >
> > > That's a new feature, but we would like to test both 1.10 and the
> master
> > > version of KubernetesExecutor with that. It should work for both of
> them
> > -
> > > there is no coupling/dependency in the "airflow' code for it.
> > >
> > > However, there is a strong coupling in the tests. We have the
> > > "kubernetes_tests" running tests using all three: chart, production
> > docker,
> > > and Airflow, Those tests will have to be likely adapted to work with
> the
> > > new git-sync option. They were disabled previously as we had problems
> > with
> > > them before the helm chart was used for tests but we can turn them back
> > on
> > > now when git-sync is added to the helm chart. Those tests are part of
> > > airflow test suite and we discussed with Daniel that they should stay
> > there
> > > - those tests are importing airflow code, they are using latest example
> > > dags which are also in the airflow code.
> > >
> > > So we have two ways how we can develop this -
> > > A) monorepo (current)
> > > B) separate repos.
> > >
> > > Just to remind - he goal is that our change is tested against:
> > >
> > > 1) Released Airflow version (say 1.10.11).
> > > 2) Development airflow version (master - soon possibly development)
> > > 3) Development docker image built with either "development" or
> "1.10.11"
> > > (we can release the Docker image for 1.10.11 independently from the
> > current
> > > development HEAD). The docker image is supposed to work with any
> version
> > of
> > > airflow
> > >
> > > In the case of A) Monorepo we have all that as a given.
> > >
> > > I just sent this really small PR that should do the job:
> > > https://github.com/apache/airflow/pull/9663. What it does, it takes
> the
> > > latest "development" docker image, 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-05 Thread Kaxil Naik
Hmm.. I agree the git-sync would have been a difficult one to solve if we
had separate repositories.

Well, in that case, the mono repo approach (like we have now) indeed makes
more sense.

Regarding the Kubernetes approach, I feel the ones in staging (
https://github.com/kubernetes/kubernetes/tree/master/staging) are part of
the actual product itself but in our case we were discussing between Helm
chart and Dockerfile which are not actually part of the product. And we
will need a good deal of automation if we go down that route.
I think the plain mono-repo approach is better than that one.

Regards,
Kaxil


On Sun, Jul 5, 2020 at 9:19 AM Jarek Potiuk 
wrote:

> And one more perfect illustration of what I am talking about.
>
> A very good thing just happened. I was running the PR while writing the
> email (long time as you might imagine) and the new K8S tests with 1.10.11
> just failed. https://github.com/apache/airflow/pull/9663
>
> If had released the helm chart before we would've clear (small)
> incompatibility here. And by seeing the test failing we could make decision
> what to do:
>
> 1) fix it differently
> 2) document it as a breaking Helm change,  "1.10.12+ image" and make test
> work in both cases
> 3) revert ...
>
> But at least we have na early warning that something is wrong. This is the
> clear value of running the tests at every commit.
>
> J.
>
> On Sun, Jul 5, 2020 at 10:08 AM Jarek Potiuk 
> wrote:
>
> > I just have another example of a case where splitting the repos and using
> > only "released versions" across repositories might be a complete overkill
> > when it comes to development complexity.
> >
> > We have this change from Aneesh:
> > https://github.com/apache/airflow/pull/9371 about adding a git-sync
> > option to the helm chart.
> >
> > That's a new feature, but we would like to test both 1.10 and the master
> > version of KubernetesExecutor with that. It should work for both of them
> -
> > there is no coupling/dependency in the "airflow' code for it.
> >
> > However, there is a strong coupling in the tests. We have the
> > "kubernetes_tests" running tests using all three: chart, production
> docker,
> > and Airflow, Those tests will have to be likely adapted to work with the
> > new git-sync option. They were disabled previously as we had problems
> with
> > them before the helm chart was used for tests but we can turn them back
> on
> > now when git-sync is added to the helm chart. Those tests are part of
> > airflow test suite and we discussed with Daniel that they should stay
> there
> > - those tests are importing airflow code, they are using latest example
> > dags which are also in the airflow code.
> >
> > So we have two ways how we can develop this -
> > A) monorepo (current)
> > B) separate repos.
> >
> > Just to remind - he goal is that our change is tested against:
> >
> > 1) Released Airflow version (say 1.10.11).
> > 2) Development airflow version (master - soon possibly development)
> > 3) Development docker image built with either "development" or "1.10.11"
> > (we can release the Docker image for 1.10.11 independently from the
> current
> > development HEAD). The docker image is supposed to work with any version
> of
> > airflow
> >
> > In the case of A) Monorepo we have all that as a given.
> >
> > I just sent this really small PR that should do the job:
> > https://github.com/apache/airflow/pull/9663. What it does, it takes the
> > latest "development" docker image, "development" chart, bakes in the
> latest
> > "example dags" from "development branch". The image uses either
> > "development" or released (from PyPI) "1.10.11" Airflow version - and run
> > the "development" tests against it. This is exactly what we want. If we
> add
> > new features to the helm chart, the Kubernetes tests will have to be
> > updated to include that - and this will happen in the airflow
> "development"
> > branch. The REALLY good thing in it - since we are running those tests in
> > CI build of airflow development branch - we prevent anyone from making
> > breaking changes. It is a given that both - the "development" of airflow
> > and the "1.10.11" version of airflow will continue to work with the image
> > and chart.
> >
> >
> > In the case of B) where we split the repos:
> >
> > We have to decide where to keep the "kubernetes_tests" - should they be
> in
> > "Airflow" or in "Helm". They are testing BOTH so we can choose either
> way.
> > Together with Daniel we plan to expand those tests to cover all the
> > different options we have in the Chart - testing all of it - Kubernetes
> > Executor, Celery Executor running on Kubernetes, MySQL (once we add it),
> > etc. etc. So we want to make sure we have a matrix of tests covering a
> > number of deployment options. Those tests do not exist yet, and they will
> > have to be written. In principle - they can be moved to the "Helm"
> > repository. That's where they conceptually belong. However - there is a
> > Huge value in running 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-05 Thread Jarek Potiuk
And one more perfect illustration of what I am talking about.

A very good thing just happened. I was running the PR while writing the
email (long time as you might imagine) and the new K8S tests with 1.10.11
just failed. https://github.com/apache/airflow/pull/9663

If had released the helm chart before we would've clear (small)
incompatibility here. And by seeing the test failing we could make decision
what to do:

1) fix it differently
2) document it as a breaking Helm change,  "1.10.12+ image" and make test
work in both cases
3) revert ...

But at least we have na early warning that something is wrong. This is the
clear value of running the tests at every commit.

J.

On Sun, Jul 5, 2020 at 10:08 AM Jarek Potiuk 
wrote:

> I just have another example of a case where splitting the repos and using
> only "released versions" across repositories might be a complete overkill
> when it comes to development complexity.
>
> We have this change from Aneesh:
> https://github.com/apache/airflow/pull/9371 about adding a git-sync
> option to the helm chart.
>
> That's a new feature, but we would like to test both 1.10 and the master
> version of KubernetesExecutor with that. It should work for both of them -
> there is no coupling/dependency in the "airflow' code for it.
>
> However, there is a strong coupling in the tests. We have the
> "kubernetes_tests" running tests using all three: chart, production docker,
> and Airflow, Those tests will have to be likely adapted to work with the
> new git-sync option. They were disabled previously as we had problems with
> them before the helm chart was used for tests but we can turn them back on
> now when git-sync is added to the helm chart. Those tests are part of
> airflow test suite and we discussed with Daniel that they should stay there
> - those tests are importing airflow code, they are using latest example
> dags which are also in the airflow code.
>
> So we have two ways how we can develop this -
> A) monorepo (current)
> B) separate repos.
>
> Just to remind - he goal is that our change is tested against:
>
> 1) Released Airflow version (say 1.10.11).
> 2) Development airflow version (master - soon possibly development)
> 3) Development docker image built with either "development" or "1.10.11"
> (we can release the Docker image for 1.10.11 independently from the current
> development HEAD). The docker image is supposed to work with any version of
> airflow
>
> In the case of A) Monorepo we have all that as a given.
>
> I just sent this really small PR that should do the job:
> https://github.com/apache/airflow/pull/9663. What it does, it takes the
> latest "development" docker image, "development" chart, bakes in the latest
> "example dags" from "development branch". The image uses either
> "development" or released (from PyPI) "1.10.11" Airflow version - and run
> the "development" tests against it. This is exactly what we want. If we add
> new features to the helm chart, the Kubernetes tests will have to be
> updated to include that - and this will happen in the airflow "development"
> branch. The REALLY good thing in it - since we are running those tests in
> CI build of airflow development branch - we prevent anyone from making
> breaking changes. It is a given that both - the "development" of airflow
> and the "1.10.11" version of airflow will continue to work with the image
> and chart.
>
>
> In the case of B) where we split the repos:
>
> We have to decide where to keep the "kubernetes_tests" - should they be in
> "Airflow" or in "Helm". They are testing BOTH so we can choose either way.
> Together with Daniel we plan to expand those tests to cover all the
> different options we have in the Chart - testing all of it - Kubernetes
> Executor, Celery Executor running on Kubernetes, MySQL (once we add it),
> etc. etc. So we want to make sure we have a matrix of tests covering a
> number of deployment options. Those tests do not exist yet, and they will
> have to be written. In principle - they can be moved to the "Helm"
> repository. That's where they conceptually belong. However - there is a
> Huge value in running the tests in airflow "development" - the value is
> that no-one will be able to break the "development" airflow, because those
> tests are run with every PR. I think we have no choice but to run those
> tests always in development. Otherwise, people maintaining the helm chart
> will have to fix the problems introduced by people changing Airflow code. I
> think this is a pretty bad idea to allow that. So if we move those tests to
> Helm Chart repo we have to figure out how to run those "kubernetes" tests
> in CI for every build. This is quite possible - by getting the latest
> master from helm chart and running the build, but it has several problems:
>
> 1) The test code for CI will have to continue to stay in Airflow (to run
> CI builds) - this means that we already have coupling and some code related
> to the execution of the helm tests has to be 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-05 Thread Jarek Potiuk
I just have another example of a case where splitting the repos and using
only "released versions" across repositories might be a complete overkill
when it comes to development complexity.

We have this change from Aneesh:
https://github.com/apache/airflow/pull/9371 about
adding a git-sync option to the helm chart.

That's a new feature, but we would like to test both 1.10 and the master
version of KubernetesExecutor with that. It should work for both of them -
there is no coupling/dependency in the "airflow' code for it.

However, there is a strong coupling in the tests. We have the
"kubernetes_tests" running tests using all three: chart, production docker,
and Airflow, Those tests will have to be likely adapted to work with the
new git-sync option. They were disabled previously as we had problems with
them before the helm chart was used for tests but we can turn them back on
now when git-sync is added to the helm chart. Those tests are part of
airflow test suite and we discussed with Daniel that they should stay there
- those tests are importing airflow code, they are using latest example
dags which are also in the airflow code.

So we have two ways how we can develop this -
A) monorepo (current)
B) separate repos.

Just to remind - he goal is that our change is tested against:

1) Released Airflow version (say 1.10.11).
2) Development airflow version (master - soon possibly development)
3) Development docker image built with either "development" or "1.10.11"
(we can release the Docker image for 1.10.11 independently from the current
development HEAD). The docker image is supposed to work with any version of
airflow

In the case of A) Monorepo we have all that as a given.

I just sent this really small PR that should do the job:
https://github.com/apache/airflow/pull/9663. What it does, it takes the
latest "development" docker image, "development" chart, bakes in the latest
"example dags" from "development branch". The image uses either
"development" or released (from PyPI) "1.10.11" Airflow version - and run
the "development" tests against it. This is exactly what we want. If we add
new features to the helm chart, the Kubernetes tests will have to be
updated to include that - and this will happen in the airflow "development"
branch. The REALLY good thing in it - since we are running those tests in
CI build of airflow development branch - we prevent anyone from making
breaking changes. It is a given that both - the "development" of airflow
and the "1.10.11" version of airflow will continue to work with the image
and chart.


In the case of B) where we split the repos:

We have to decide where to keep the "kubernetes_tests" - should they be in
"Airflow" or in "Helm". They are testing BOTH so we can choose either way.
Together with Daniel we plan to expand those tests to cover all the
different options we have in the Chart - testing all of it - Kubernetes
Executor, Celery Executor running on Kubernetes, MySQL (once we add it),
etc. etc. So we want to make sure we have a matrix of tests covering a
number of deployment options. Those tests do not exist yet, and they will
have to be written. In principle - they can be moved to the "Helm"
repository. That's where they conceptually belong. However - there is a
Huge value in running the tests in airflow "development" - the value is
that no-one will be able to break the "development" airflow, because those
tests are run with every PR. I think we have no choice but to run those
tests always in development. Otherwise, people maintaining the helm chart
will have to fix the problems introduced by people changing Airflow code. I
think this is a pretty bad idea to allow that. So if we move those tests to
Helm Chart repo we have to figure out how to run those "kubernetes" tests
in CI for every build. This is quite possible - by getting the latest
master from helm chart and running the build, but it has several problems:

1) The test code for CI will have to continue to stay in Airflow (to run CI
builds) - this means that we already have coupling and some code related to
the execution of the helm tests has to be any way in Airflow.

2) Bigger problem. What happens if as "Airflow developer" you DO introduce
a change that breaks the helm chart? You will see a CI error and. You
will not know what to do. Do you involve people who maintain the helm chart
and wait for them? I think not. You should be able to reproduce the problem
locally and fix it yourself (maybe with the help of others - but you should
be able to fix your own commit). We would have to teach people how to bring
the docker image and helm chart code from the latest version and run the
tests. We could do it automatically with Breeze (similarly as we do with
other integrations - where we bring in Kerberos, Mongo, and a multitude of
others) without them even knowing it, but this might be fairly complex and
prone to errors. In Monorepo - we already have a simple way of reproducing
and running the tests locally 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-03 Thread Ash Berlin-Taylor
Monorepo FTW.  

Yes, it gets a little bit messier around release, but the approach of
automatically extracting out the commits (or parts of commits) to a
separate repo for releasing may be the solution to that problem


-ash

On Jul 3 2020, at 7:51 pm, Kaxil Naik  wrote:

> I will take a look at the Kubernetes approach and get back to this thread.
>  
> We had a discussion with Daniel yesterday and we are both concerned about
>> all the overhead for people like us who work on all three "entities"
>> at the
>> same time. Even just explaining how to work with Pull Requests and in what
>> sequence those PRs would have to be opened and merged in case of changes
>> that are spanning across several "entities" - was a challenge. I was unable
>> to clearly explain the sequence and way of reviewing/merging the PRs that
>> will have to be made if we have submodules. This is a bad sign as I was
>> using submodules in the past and know how it works but I was unable to
>> explain it clearly.
>  
>  
> We don't even need submodules tbh. We can just use Bash Script that
> pulls a
> pinned Helm Chart version.
> We only need Helm chart to run integration test for k8s (atleast for now).
> We already use tons of Bash scripts.
>  
> One of the important benefits of separation that changes in one component
> should not need change in other component, atleast
> not immediately.
>  
> Changes in Helm chart and Docker file should never need changes in Airflow
> Changes in Airflow should only ever need a change in Dockerfile and Helm
> Chart after a new version is released.
>  
> I just had a talk with Daniel too and still didn't find a good enough
> reason to have them in the same repo.
>  
> I will definitely look at the Kubernetes approach (maybe it is better) and
> get back to this thread. But as of now I don't see any major PROs
> for having them in the same repo.
>  
> Regards,
> Kaxil
>  
>  
>  
> On Fri, Jul 3, 2020 at 5:00 PM Jarek Potiuk 
> wrote:
>  
>> I think Ry's point is an important one - I thought about writing a longer
>> post but I looked at the Kubernetes structure and I really like it so just
>> wanted to comment on this last one.
>>  
>> Seems that it is simply one "authoritative" (or source of truth) repo where
>> everything is developed in monorepo fashion but then there is a bot
>> that moves every commit related to subdirectories to those "split-out"
>> repos. There are never direct commits of people or PRs in the "split-out"
>> repositories. This is very similar to my original proposal to have
>> dedicated repos used for releases - but with an automated way of publishing
>> the commits to the "separated" repos at the moment, they are merged to
>> master in the main repo. I love it.
>>  
>> I think it's really good and "pragmatic" solution. The code is
>> available in
>> separate repos, including the history of commits related to each "entity"
>> (so only chart-related commits in chart repo). Issues for particular
>> "entities" are in those separate repos as well (something that Kaxil
>> mentioned). Users (not developers!) who are interested only in Dockerfile
>> or Helm Chart have separate repos they can look at - with only relevant
>> changes and history of releases for that particular entity. They can raise
>> issues there (and in GitHub, we can easily refer to those issues from the
>> main "airflow" repo). All the discussion from "user issues" are kept
>> in the
>> relevant repositories. Still - comments about development changes (and
>> related issues) might still be kept in the main "airflow" repo - next to
>> other "development" changes.
>>  
>> We can run separate releases from those linked repositories and even
>> publish sources directly from those repositories rather than from the main
>> one. At the same time - we avoid all the hassle of submodules.
>>  
>> We had a discussion with Daniel yesterday and we are both concerned about
>> all the overhead for people like us who work on all three "entities"
>> at the
>> same time. Even just explaining how to work with Pull Requests and in what
>> sequence those PRs would have to be opened and merged in case of changes
>> that are spanning across several "entities" - was a challenge. I was unable
>> to clearly explain the sequence and way of reviewing/merging the PRs that
>> will have to be made if we have submodules. This is a bad sign as I was
>> using submodules in the past and know how it works but I was unable to
>> explain it clearly.
>>  
>> I really, really like Kubernetes approach - seems that it's one of the
>> cases where we can "eat cake and have it too".
>>  
>> J.
>>  
>>  
>> On Thu, Jul 2, 2020 at 5:59 PM Ry Walker  wrote:
>>  
>> > One reason to have a monorepo is for project branding, and end user
>> > experience. But for component development experience, it's nice to
>> have a
>> > small, dedicated repo.
>> >
>> > I think the git submodule approach is technically sound, but is at odds
>> > with making the project easy to 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-03 Thread Kaxil Naik
I will take a look at the Kubernetes approach and get back to this thread.

We had a discussion with Daniel yesterday and we are both concerned about
> all the overhead for people like us who work on all three "entities" at the
> same time. Even just explaining how to work with Pull Requests and in what
> sequence those PRs would have to be opened and merged in case of changes
> that are spanning across several "entities" - was a challenge. I was unable
> to clearly explain the sequence and way of reviewing/merging the PRs that
> will have to be made if we have submodules. This is a bad sign as I was
> using submodules in the past and know how it works but I was unable to
> explain it clearly.


We don't even need submodules tbh. We can just use Bash Script that pulls a
pinned Helm Chart version.
We only need Helm chart to run integration test for k8s (atleast for now).
We already use tons of Bash scripts.

One of the important benefits of separation that changes in one component
should not need change in other component, atleast
not immediately.

Changes in Helm chart and Docker file should never need changes in Airflow
Changes in Airflow should only ever need a change in Dockerfile and Helm
Chart after a new version is released.

I just had a talk with Daniel too and still didn't find a good enough
reason to have them in the same repo.

I will definitely look at the Kubernetes approach (maybe it is better) and
get back to this thread. But as of now I don't see any major PROs
for having them in the same repo.

Regards,
Kaxil



On Fri, Jul 3, 2020 at 5:00 PM Jarek Potiuk 
wrote:

> I think Ry's point is an important one - I thought about writing a longer
> post but I looked at the Kubernetes structure and I really like it so just
> wanted to comment on this last one.
>
> Seems that it is simply one "authoritative" (or source of truth) repo where
> everything is developed in monorepo fashion but then there is a bot
> that moves every commit related to subdirectories to those "split-out"
> repos. There are never direct commits of people or PRs in the "split-out"
> repositories. This is very similar to my original proposal to have
> dedicated repos used for releases - but with an automated way of publishing
> the commits to the "separated" repos at the moment, they are merged to
> master in the main repo. I love it.
>
> I think it's really good and "pragmatic" solution. The code is available in
> separate repos, including the history of commits related to each "entity"
> (so only chart-related commits in chart repo). Issues for particular
> "entities" are in those separate repos as well (something that Kaxil
> mentioned). Users (not developers!) who are interested only in Dockerfile
> or Helm Chart have separate repos they can look at - with only relevant
> changes and history of releases for that particular entity. They can raise
> issues there (and in GitHub, we can easily refer to those issues from the
> main "airflow" repo). All the discussion from "user issues" are kept in the
> relevant repositories. Still - comments about development changes (and
> related issues) might still be kept in the main "airflow" repo - next to
> other "development" changes.
>
> We can run separate releases from those linked repositories and even
> publish sources directly from those repositories rather than from the main
> one. At the same time - we avoid all the hassle of submodules.
>
> We had a discussion with Daniel yesterday and we are both concerned about
> all the overhead for people like us who work on all three "entities" at the
> same time. Even just explaining how to work with Pull Requests and in what
> sequence those PRs would have to be opened and merged in case of changes
> that are spanning across several "entities" - was a challenge. I was unable
> to clearly explain the sequence and way of reviewing/merging the PRs that
> will have to be made if we have submodules. This is a bad sign as I was
> using submodules in the past and know how it works but I was unable to
> explain it clearly.
>
> I really, really like Kubernetes approach - seems that it's one of the
> cases where we can "eat cake and have it too".
>
> J.
>
>
> On Thu, Jul 2, 2020 at 5:59 PM Ry Walker  wrote:
>
> > One reason to have a monorepo is for project branding, and end user
> > experience. But for component development experience, it's nice to have a
> > small, dedicated repo.
> >
> > I think the git submodule approach is technically sound, but is at odds
> > with making the project easy to consume/understand from the end user
> > perspective, especially if we expand the use of subprojects. And the main
> > Airflow commit graph would appear to be slowing down which is bad for
> > Airflow brand perception.
> >
> > Kubernetes has many sub-repos that are integrated into the main repo -
> > which I think could be the best of both worlds:
> > Example: https://github.com/kubernetes/kubernetes/tree/master/staging
> >
> > I haven't dug in very 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-02 Thread Ry Walker
One reason to have a monorepo is for project branding, and end user
experience. But for component development experience, it's nice to have a
small, dedicated repo.

I think the git submodule approach is technically sound, but is at odds
with making the project easy to consume/understand from the end user
perspective, especially if we expand the use of subprojects. And the main
Airflow commit graph would appear to be slowing down which is bad for
Airflow brand perception.

Kubernetes has many sub-repos that are integrated into the main repo -
which I think could be the best of both worlds:
Example: https://github.com/kubernetes/kubernetes/tree/master/staging

I haven't dug in very deeply, and I won't pretend to understand how
challenging it may be to maintain this structure, but I'd support breaking
more components out of the main Airflow repo for dev purposes (for example,
in the future, it'd be nice to have airflow-cli, airflow-api,
airflow-scheduler, individual provider repos that are cleanly separated) as
long as we bring the commits/contributions back into the monorepo with
automation.

Maybe we could dive a little deeper into how K8s is operating, before going
with submodules?

-Ry




On Thu, Jul 2, 2020 at 11:24 AM Kaxil Naik  wrote:

> Let's come to a consensus first before we do anything :-)
>
> Is everyone happy with separate repo approach? Let's wait for 72 hours to
> hear from all and then have a plan on how we do it? WDYT?
>
> But indeed git submodules approach sounds good. We do it for for *Airflow
> Site *(
>
> https://github.com/apache/airflow-site/tree/master/landing-pages/site/themes
> )
> too.
>
> Regards,
> Kaxil
>
> On Thu, Jul 2, 2020 at 4:15 PM Jarek Potiuk 
> wrote:
>
> > Absolutely - I am happy to add "best practices" and short "howto do stuff
> > with git submodules"  - and this knowledge will only be needed for
> > interacting with prod image/helmchart/running kubernetes tests. For all
> the
> > other purposes it should be "business as usual".
> >
> > On Thu, Jul 2, 2020 at 4:53 PM Daniel Imberman <
> daniel.imber...@gmail.com>
> > wrote:
> >
> > > I think git submodules sounds like a great idea. We would need to write
> > > this into the CONTRIBUTING.md to let people know how to do it but It’s
> a
> > > “teach once” situation.
> > >
> > > via Newton Mail [
> > >
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2
> > > ]
> > > On Thu, Jul 2, 2020 at 2:44 AM, Tomasz Urbaszek 
> > > wrote:
> > > I support the idea of separate repos. The git submodules mentioned by
> > > Jarek sounds like an interesting solution. It may add some complexity
> > > for new contributors but it's not rocket science. If we agree on using
> > > this we should add small how-to in contributing.rst I think (i.e. do I
> > > have to have fork of each repo?).
> > >
> > > As stressed previously if we go this route we should make sure we have
> > > nice testing of all those three components. Regarding the versioning,
> > > I have no strong opinion but I fully support using separate issues for
> > > airflow, docker, and helm.
> > >
> > > Tomek
> > >
> > >
> > > On Thu, Jul 2, 2020 at 9:26 AM Jarek Potiuk 
> > > wrote:
> > > >
> > > > On Thu, Jul 2, 2020 at 3:16 AM Daniel Imberman <
> > > daniel.imber...@gmail.com>
> > > > wrote:
> > > >
> > > > I’m fine with keeping it as three separate repos but merging testing
> > > > > somehow (e.g. the source code chart would pull the helm/docker
> chart
> > > into
> > > > > .build) but we need to do it in a way that doesn’t make testing too
> > > > > difficult.
> > > > >
> > > > > So for example: How do I test/integration test a change that
> > involves a
> > > > > change to all three and has to be done at the same time? Perhaps a
> > > user can
> > > > > “register” a branch of helm and docker when they start up breeze?
> Or
> > > > > perhaps we create a “parent” integration test that uses the three
> > > together?
> > > > >
> > > >
> > > > Yes, those are exactly my concerns when splitting the repos.
> > > >
> > > > I think testing for development should remain in the "airflow" repo.
> It
> > > is
> > > > the "central one" in fact. I slept it over and I think using
> "released"
> > > > versions for development testing will suffer from this "we need a
> > change
> > > in
> > > > all three of those".
> > > >
> > > > But we have an easy solution I think.
> > > >
> > > > I think that simply setting submodules properly should do to the job:
> > > > https://git-scm.com/book/en/v2/Git-Tools-Submodules. They seem to be
> > > > perfect for our case.
> > > >
> > > > For those who have not used it - in short - submodules work in the
> way
> > > that
> > > > they register the "linked repos" and store related "hash" of the
> commit
> > > > from that linked repo. For example, the "chart" folder will be a link
> > to
> > > > "apache/airflow-helm-chart". We can also move the prod Dockerfile to
> a
> > > > subfolder and link it to the separate repo. Git submodule 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-02 Thread Kaxil Naik
Let's come to a consensus first before we do anything :-)

Is everyone happy with separate repo approach? Let's wait for 72 hours to
hear from all and then have a plan on how we do it? WDYT?

But indeed git submodules approach sounds good. We do it for for *Airflow
Site *(
https://github.com/apache/airflow-site/tree/master/landing-pages/site/themes)
too.

Regards,
Kaxil

On Thu, Jul 2, 2020 at 4:15 PM Jarek Potiuk 
wrote:

> Absolutely - I am happy to add "best practices" and short "howto do stuff
> with git submodules"  - and this knowledge will only be needed for
> interacting with prod image/helmchart/running kubernetes tests. For all the
> other purposes it should be "business as usual".
>
> On Thu, Jul 2, 2020 at 4:53 PM Daniel Imberman 
> wrote:
>
> > I think git submodules sounds like a great idea. We would need to write
> > this into the CONTRIBUTING.md to let people know how to do it but It’s a
> > “teach once” situation.
> >
> > via Newton Mail [
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2
> > ]
> > On Thu, Jul 2, 2020 at 2:44 AM, Tomasz Urbaszek 
> > wrote:
> > I support the idea of separate repos. The git submodules mentioned by
> > Jarek sounds like an interesting solution. It may add some complexity
> > for new contributors but it's not rocket science. If we agree on using
> > this we should add small how-to in contributing.rst I think (i.e. do I
> > have to have fork of each repo?).
> >
> > As stressed previously if we go this route we should make sure we have
> > nice testing of all those three components. Regarding the versioning,
> > I have no strong opinion but I fully support using separate issues for
> > airflow, docker, and helm.
> >
> > Tomek
> >
> >
> > On Thu, Jul 2, 2020 at 9:26 AM Jarek Potiuk 
> > wrote:
> > >
> > > On Thu, Jul 2, 2020 at 3:16 AM Daniel Imberman <
> > daniel.imber...@gmail.com>
> > > wrote:
> > >
> > > I’m fine with keeping it as three separate repos but merging testing
> > > > somehow (e.g. the source code chart would pull the helm/docker chart
> > into
> > > > .build) but we need to do it in a way that doesn’t make testing too
> > > > difficult.
> > > >
> > > > So for example: How do I test/integration test a change that
> involves a
> > > > change to all three and has to be done at the same time? Perhaps a
> > user can
> > > > “register” a branch of helm and docker when they start up breeze? Or
> > > > perhaps we create a “parent” integration test that uses the three
> > together?
> > > >
> > >
> > > Yes, those are exactly my concerns when splitting the repos.
> > >
> > > I think testing for development should remain in the "airflow" repo. It
> > is
> > > the "central one" in fact. I slept it over and I think using "released"
> > > versions for development testing will suffer from this "we need a
> change
> > in
> > > all three of those".
> > >
> > > But we have an easy solution I think.
> > >
> > > I think that simply setting submodules properly should do to the job:
> > > https://git-scm.com/book/en/v2/Git-Tools-Submodules. They seem to be
> > > perfect for our case.
> > >
> > > For those who have not used it - in short - submodules work in the way
> > that
> > > they register the "linked repos" and store related "hash" of the commit
> > > from that linked repo. For example, the "chart" folder will be a link
> to
> > > "apache/airflow-helm-chart". We can also move the prod Dockerfile to a
> > > subfolder and link it to the separate repo. Git submodule has a
> > > built-in mechanism to a) update to the latest version of the repo, b)
> > > commit your changes to the linked repo from there which is all we
> need. I
> > > used those few times - I never liked submodules for sharing "library"
> > code,
> > > but for sharing helm/Docker It seems perfect.
> > >
> > > From the "regular" developer point of view - you do not need to
> > get/update
> > > submodules if you do not need to use them - so for all the development
> > > purposes if you only change the "airflow" code, you would not even need
> > to
> > > sync chart or Dockerfile. You do "git checkout" as usual and it should
> > > work. So basically - no change for "regular" airflow development.
> > >
> > > However, if you do need to work on helm + Docker + code, then you
> simply
> > to
> > > "git submodule update", go to the linked "helm" or "docker" folder,
> > > checkout the "master" version and you start making changes. The only
> > thing
> > > to remember when you want to push your changes is to do `git push
> > > --recurse-sumbodules="check" ` and it will make sure that all the repos
> > are
> > > updated, It is a bit involved, but latest git version have a very good
> > > support and it must only be used by people who work on airflow +
> docker +
> > > helm - all the others are unaffected.
> > >
> > > From the CI perspective also nothing changes - when we checkout the
> code
> > we
> > > will include submodules and our test harness will be largely unchanged.
> > > 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-02 Thread Jarek Potiuk
Absolutely - I am happy to add "best practices" and short "howto do stuff
with git submodules"  - and this knowledge will only be needed for
interacting with prod image/helmchart/running kubernetes tests. For all the
other purposes it should be "business as usual".

On Thu, Jul 2, 2020 at 4:53 PM Daniel Imberman 
wrote:

> I think git submodules sounds like a great idea. We would need to write
> this into the CONTRIBUTING.md to let people know how to do it but It’s a
> “teach once” situation.
>
> via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2
> ]
> On Thu, Jul 2, 2020 at 2:44 AM, Tomasz Urbaszek 
> wrote:
> I support the idea of separate repos. The git submodules mentioned by
> Jarek sounds like an interesting solution. It may add some complexity
> for new contributors but it's not rocket science. If we agree on using
> this we should add small how-to in contributing.rst I think (i.e. do I
> have to have fork of each repo?).
>
> As stressed previously if we go this route we should make sure we have
> nice testing of all those three components. Regarding the versioning,
> I have no strong opinion but I fully support using separate issues for
> airflow, docker, and helm.
>
> Tomek
>
>
> On Thu, Jul 2, 2020 at 9:26 AM Jarek Potiuk 
> wrote:
> >
> > On Thu, Jul 2, 2020 at 3:16 AM Daniel Imberman <
> daniel.imber...@gmail.com>
> > wrote:
> >
> > I’m fine with keeping it as three separate repos but merging testing
> > > somehow (e.g. the source code chart would pull the helm/docker chart
> into
> > > .build) but we need to do it in a way that doesn’t make testing too
> > > difficult.
> > >
> > > So for example: How do I test/integration test a change that involves a
> > > change to all three and has to be done at the same time? Perhaps a
> user can
> > > “register” a branch of helm and docker when they start up breeze? Or
> > > perhaps we create a “parent” integration test that uses the three
> together?
> > >
> >
> > Yes, those are exactly my concerns when splitting the repos.
> >
> > I think testing for development should remain in the "airflow" repo. It
> is
> > the "central one" in fact. I slept it over and I think using "released"
> > versions for development testing will suffer from this "we need a change
> in
> > all three of those".
> >
> > But we have an easy solution I think.
> >
> > I think that simply setting submodules properly should do to the job:
> > https://git-scm.com/book/en/v2/Git-Tools-Submodules. They seem to be
> > perfect for our case.
> >
> > For those who have not used it - in short - submodules work in the way
> that
> > they register the "linked repos" and store related "hash" of the commit
> > from that linked repo. For example, the "chart" folder will be a link to
> > "apache/airflow-helm-chart". We can also move the prod Dockerfile to a
> > subfolder and link it to the separate repo. Git submodule has a
> > built-in mechanism to a) update to the latest version of the repo, b)
> > commit your changes to the linked repo from there which is all we need. I
> > used those few times - I never liked submodules for sharing "library"
> code,
> > but for sharing helm/Docker It seems perfect.
> >
> > From the "regular" developer point of view - you do not need to
> get/update
> > submodules if you do not need to use them - so for all the development
> > purposes if you only change the "airflow" code, you would not even need
> to
> > sync chart or Dockerfile. You do "git checkout" as usual and it should
> > work. So basically - no change for "regular" airflow development.
> >
> > However, if you do need to work on helm + Docker + code, then you simply
> to
> > "git submodule update", go to the linked "helm" or "docker" folder,
> > checkout the "master" version and you start making changes. The only
> thing
> > to remember when you want to push your changes is to do `git push
> > --recurse-sumbodules="check" ` and it will make sure that all the repos
> are
> > updated, It is a bit involved, but latest git version have a very good
> > support and it must only be used by people who work on airflow + docker +
> > helm - all the others are unaffected.
> >
> > From the CI perspective also nothing changes - when we checkout the code
> we
> > will include submodules and our test harness will be largely unchanged.
> > Submodule provides us with the right mechanism for cross dependency even
> if
> > we use branches.
> >
> > If everyone will be ok with that - I am happy to set it up, With
> submodules
> > - we can switch to separate repos even without releasing helm and Prod
> > chart "officially".
> >
> > J.
> >
> >
> >
> > >
> > > via Newton Mail [
> > >
> https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2
> > > ]
> > > On Wed, Jul 1, 2020 at 3:20 PM, Jarek Potiuk  >
> > > wrote:
> > > Sure. We can work with such an approach. There will be some
> dependencies
> > > that we might find are problematic, but If we all see that it's
> > > 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-02 Thread Daniel Imberman
I think git submodules sounds like a great idea. We would need to write this 
into the CONTRIBUTING.md to let people know how to do it but It’s a “teach 
once” situation.

via Newton Mail 
[https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2]
On Thu, Jul 2, 2020 at 2:44 AM, Tomasz Urbaszek  wrote:
I support the idea of separate repos. The git submodules mentioned by
Jarek sounds like an interesting solution. It may add some complexity
for new contributors but it's not rocket science. If we agree on using
this we should add small how-to in contributing.rst I think (i.e. do I
have to have fork of each repo?).

As stressed previously if we go this route we should make sure we have
nice testing of all those three components. Regarding the versioning,
I have no strong opinion but I fully support using separate issues for
airflow, docker, and helm.

Tomek


On Thu, Jul 2, 2020 at 9:26 AM Jarek Potiuk  wrote:
>
> On Thu, Jul 2, 2020 at 3:16 AM Daniel Imberman 
> wrote:
>
> I’m fine with keeping it as three separate repos but merging testing
> > somehow (e.g. the source code chart would pull the helm/docker chart into
> > .build) but we need to do it in a way that doesn’t make testing too
> > difficult.
> >
> > So for example: How do I test/integration test a change that involves a
> > change to all three and has to be done at the same time? Perhaps a user can
> > “register” a branch of helm and docker when they start up breeze? Or
> > perhaps we create a “parent” integration test that uses the three together?
> >
>
> Yes, those are exactly my concerns when splitting the repos.
>
> I think testing for development should remain in the "airflow" repo. It is
> the "central one" in fact. I slept it over and I think using "released"
> versions for development testing will suffer from this "we need a change in
> all three of those".
>
> But we have an easy solution I think.
>
> I think that simply setting submodules properly should do to the job:
> https://git-scm.com/book/en/v2/Git-Tools-Submodules. They seem to be
> perfect for our case.
>
> For those who have not used it - in short - submodules work in the way that
> they register the "linked repos" and store related "hash" of the commit
> from that linked repo. For example, the "chart" folder will be a link to
> "apache/airflow-helm-chart". We can also move the prod Dockerfile to a
> subfolder and link it to the separate repo. Git submodule has a
> built-in mechanism to a) update to the latest version of the repo, b)
> commit your changes to the linked repo from there which is all we need. I
> used those few times - I never liked submodules for sharing "library" code,
> but for sharing helm/Docker It seems perfect.
>
> From the "regular" developer point of view - you do not need to get/update
> submodules if you do not need to use them - so for all the development
> purposes if you only change the "airflow" code, you would not even need to
> sync chart or Dockerfile. You do "git checkout" as usual and it should
> work. So basically - no change for "regular" airflow development.
>
> However, if you do need to work on helm + Docker + code, then you simply to
> "git submodule update", go to the linked "helm" or "docker" folder,
> checkout the "master" version and you start making changes. The only thing
> to remember when you want to push your changes is to do `git push
> --recurse-sumbodules="check" ` and it will make sure that all the repos are
> updated, It is a bit involved, but latest git version have a very good
> support and it must only be used by people who work on airflow + docker +
> helm - all the others are unaffected.
>
> From the CI perspective also nothing changes - when we checkout the code we
> will include submodules and our test harness will be largely unchanged.
> Submodule provides us with the right mechanism for cross dependency even if
> we use branches.
>
> If everyone will be ok with that - I am happy to set it up, With submodules
> - we can switch to separate repos even without releasing helm and Prod
> chart "officially".
>
> J.
>
>
>
> >
> > via Newton Mail [
> > https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2
> > ]
> > On Wed, Jul 1, 2020 at 3:20 PM, Jarek Potiuk 
> > wrote:
> > Sure. We can work with such an approach. There will be some dependencies
> > that we might find are problematic, but If we all see that it's
> > worth trying, there is a clear benefit that it makes for a "clean"
> > split between those different "entities". And possibly once we release
> > first versions of both image and chart, such problems will be rare and easy
> > to fix.
> >
> > I personally think such split is inevitable eventually, it's just a matter
> > when to do it. If we decide to make this happen soon - I am more than happy
> > to work on making the split reality.
> >
> > One prerequisite to that is that all those - Helm Chart, Prod Image and
> > Airflow are released in stable versions separately "officially" 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-02 Thread Tomasz Urbaszek
I support the idea of separate repos. The git submodules mentioned by
Jarek sounds like an interesting solution. It may add some complexity
for new contributors but it's not rocket science. If we agree on using
this we should add small how-to in contributing.rst I think (i.e. do I
have to have fork of each repo?).

As stressed previously if we go this route we should make sure we have
nice testing of all those three components. Regarding the versioning,
I have no strong opinion but I fully support using separate issues for
airflow, docker, and helm.

Tomek


On Thu, Jul 2, 2020 at 9:26 AM Jarek Potiuk  wrote:
>
> On Thu, Jul 2, 2020 at 3:16 AM Daniel Imberman 
> wrote:
>
> I’m fine with keeping it as three separate repos but merging testing
> > somehow (e.g. the source code chart would pull the helm/docker chart into
> > .build) but we need to do it in a way that doesn’t make testing too
> > difficult.
> >
> > So for example: How do I test/integration test a change that involves a
> > change to all three and has to be done at the same time? Perhaps a user can
> > “register” a branch of helm and docker when they start up breeze? Or
> > perhaps we create a “parent” integration test that uses the three together?
> >
>
> Yes, those are exactly my concerns when splitting the repos.
>
> I think testing for development should remain in the "airflow" repo. It is
> the "central one" in fact. I slept it over and I think using "released"
> versions for development testing will suffer from this "we need a change in
> all three of those".
>
> But we have an easy solution  I think.
>
> I think that simply setting submodules properly should do to the job:
> https://git-scm.com/book/en/v2/Git-Tools-Submodules. They seem to be
> perfect for our case.
>
> For those who have not used it - in short - submodules work in the way that
> they register the "linked repos" and store related "hash" of the commit
> from that linked repo. For example, the "chart" folder will be a link to
> "apache/airflow-helm-chart". We can also move the prod Dockerfile to a
> subfolder and link it to the separate repo. Git submodule has a
> built-in mechanism to a) update to the latest version of the repo, b)
> commit your changes to the linked repo from there which is all we need. I
> used those few times - I never liked submodules for sharing "library" code,
> but for sharing helm/Docker It seems perfect.
>
> From the "regular" developer point of view - you do not need to get/update
> submodules if you do not need to use them - so for all the development
> purposes if you only change the "airflow" code, you would not even need to
> sync chart or Dockerfile. You do "git checkout" as usual and it should
> work. So basically - no change for "regular" airflow development.
>
> However, if you do need to work on helm + Docker + code, then you simply to
> "git submodule update", go to the linked "helm" or "docker" folder,
> checkout the "master" version and you start making changes. The only thing
> to remember when you want to push your changes is to do `git push
> --recurse-sumbodules="check" ` and it will make sure that all the repos are
> updated, It is a bit involved, but latest git version have a very good
> support and it must only be used by people who work on airflow + docker +
> helm - all the others are unaffected.
>
> From the CI perspective also nothing changes - when we checkout the code we
> will include submodules and our test harness will be largely unchanged.
> Submodule provides us with the right mechanism for cross dependency even if
> we use branches.
>
> If everyone will be ok with that - I am happy to set it up, With submodules
> - we can switch to separate repos even without releasing helm and Prod
> chart "officially".
>
> J.
>
>
>
> >
> > via Newton Mail [
> > https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2
> > ]
> > On Wed, Jul 1, 2020 at 3:20 PM, Jarek Potiuk 
> > wrote:
> > Sure. We can work with such an approach. There will be some dependencies
> > that we might find are problematic, but If we all see that it's
> > worth trying, there is a clear benefit that it makes for a "clean"
> > split between those different "entities". And possibly once we release
> > first versions of both image and chart, such problems will be rare and easy
> > to fix.
> >
> > I personally think such split is inevitable eventually, it's just a matter
> > when to do it. If we decide to make this happen soon - I am more than happy
> > to work on making the split reality.
> >
> > One prerequisite to that is that all those - Helm Chart, Prod Image and
> > Airflow are released in stable versions separately "officially" - from the
> > current sources (otherwise there will be no way to test cross-repo).
> >
> > I think for that we will need to agree on the versioning scheme and cadence
> > for the Image and Helm Chart, then copy sources from airflow and release
> > them as "baseline" including setup the tests for all of those - then 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-02 Thread Jarek Potiuk
On Thu, Jul 2, 2020 at 3:16 AM Daniel Imberman 
wrote:

I’m fine with keeping it as three separate repos but merging testing
> somehow (e.g. the source code chart would pull the helm/docker chart into
> .build) but we need to do it in a way that doesn’t make testing too
> difficult.
>
> So for example: How do I test/integration test a change that involves a
> change to all three and has to be done at the same time? Perhaps a user can
> “register” a branch of helm and docker when they start up breeze? Or
> perhaps we create a “parent” integration test that uses the three together?
>

Yes, those are exactly my concerns when splitting the repos.

I think testing for development should remain in the "airflow" repo. It is
the "central one" in fact. I slept it over and I think using "released"
versions for development testing will suffer from this "we need a change in
all three of those".

But we have an easy solution  I think.

I think that simply setting submodules properly should do to the job:
https://git-scm.com/book/en/v2/Git-Tools-Submodules. They seem to be
perfect for our case.

For those who have not used it - in short - submodules work in the way that
they register the "linked repos" and store related "hash" of the commit
from that linked repo. For example, the "chart" folder will be a link to
"apache/airflow-helm-chart". We can also move the prod Dockerfile to a
subfolder and link it to the separate repo. Git submodule has a
built-in mechanism to a) update to the latest version of the repo, b)
commit your changes to the linked repo from there which is all we need. I
used those few times - I never liked submodules for sharing "library" code,
but for sharing helm/Docker It seems perfect.

>From the "regular" developer point of view - you do not need to get/update
submodules if you do not need to use them - so for all the development
purposes if you only change the "airflow" code, you would not even need to
sync chart or Dockerfile. You do "git checkout" as usual and it should
work. So basically - no change for "regular" airflow development.

However, if you do need to work on helm + Docker + code, then you simply to
"git submodule update", go to the linked "helm" or "docker" folder,
checkout the "master" version and you start making changes. The only thing
to remember when you want to push your changes is to do `git push
--recurse-sumbodules="check" ` and it will make sure that all the repos are
updated, It is a bit involved, but latest git version have a very good
support and it must only be used by people who work on airflow + docker +
helm - all the others are unaffected.

>From the CI perspective also nothing changes - when we checkout the code we
will include submodules and our test harness will be largely unchanged.
Submodule provides us with the right mechanism for cross dependency even if
we use branches.

If everyone will be ok with that - I am happy to set it up, With submodules
- we can switch to separate repos even without releasing helm and Prod
chart "officially".

J.



>
> via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2
> ]
> On Wed, Jul 1, 2020 at 3:20 PM, Jarek Potiuk 
> wrote:
> Sure. We can work with such an approach. There will be some dependencies
> that we might find are problematic, but If we all see that it's
> worth trying, there is a clear benefit that it makes for a "clean"
> split between those different "entities". And possibly once we release
> first versions of both image and chart, such problems will be rare and easy
> to fix.
>
> I personally think such split is inevitable eventually, it's just a matter
> when to do it. If we decide to make this happen soon - I am more than happy
> to work on making the split reality.
>
> One prerequisite to that is that all those - Helm Chart, Prod Image and
> Airflow are released in stable versions separately "officially" - from the
> current sources (otherwise there will be no way to test cross-repo).
>
> I think for that we will need to agree on the versioning scheme and cadence
> for the Image and Helm Chart, then copy sources from airflow and release
> them as "baseline" including setup the tests for all of those - then we
> can remove both Helm and Dockerfile from the airflow repo. Happy to help
> with that if that's the direction we choose as a community. It is important
> though that we keep the cross-repo testing working. We have it working as
> of yesterday, so now the matter is - whatever we do we keep it running and
> have development environment support easy development and testing of
> either of the three (including CI testing cross-repos) , That's the only
> really important thing to me - the rest is more of technicality how we link
> the repos, but principle remains.
>
> Do we have an idea for the versioning scheme that we would like to use for
> the Helm Chart and prod image ?
>
> Should we make it CalVer  or SemVer
>  

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-01 Thread Daniel Imberman
I’m fine with keeping it as three separate repos but merging testing somehow 
(e.g. the source code chart would pull the helm/docker chart into .build) but 
we need to do it in a way that doesn’t make testing too difficult.

So for example: How do I test/integration test a change that involves a change 
to all three and has to be done at the same time? Perhaps a user can “register” 
a branch of helm and docker when they start up breeze? Or perhaps we create a 
“parent” integration test that uses the three together?

via Newton Mail 
[https://cloudmagic.com/k/d/mailapp?ct=dx=10.0.50=10.14.6=email_footer_2]
On Wed, Jul 1, 2020 at 3:20 PM, Jarek Potiuk  wrote:
Sure. We can work with such an approach. There will be some dependencies
that we might find are problematic, but If we all see that it's
worth trying, there is a clear benefit that it makes for a "clean"
split between those different "entities". And possibly once we release
first versions of both image and chart, such problems will be rare and easy
to fix.

I personally think such split is inevitable eventually, it's just a matter
when to do it. If we decide to make this happen soon - I am more than happy
to work on making the split reality.

One prerequisite to that is that all those - Helm Chart, Prod Image and
Airflow are released in stable versions separately "officially" - from the
current sources (otherwise there will be no way to test cross-repo).

I think for that we will need to agree on the versioning scheme and cadence
for the Image and Helm Chart, then copy sources from airflow and release
them as "baseline" including setup the tests for all of those - then we
can remove both Helm and Dockerfile from the airflow repo. Happy to help
with that if that's the direction we choose as a community. It is important
though that we keep the cross-repo testing working. We have it working as
of yesterday, so now the matter is - whatever we do we keep it running and
have development environment support easy development and testing of
either of the three (including CI testing cross-repos) , That's the only
really important thing to me - the rest is more of technicality how we link
the repos, but principle remains.

Do we have an idea for the versioning scheme that we would like to use for
the Helm Chart and prod image ?

Should we make it CalVer  or SemVer
 (or some other scheme)? And how should we treat the
combinations with Airflow?

My thoughts (but I have no strong opinions as long as someone proposes more
sensible versioning schemes):

1) Airflow code - we continue the release scheme we have (with deciding on
2.* scheme for the release). I expect in the future we might decide on
doing branches or patches so for 2.* I'd opt for going full SemVer approach
and patches released from branches.

2) I believe that Helm Chart can be versioned with its own version (then
you specify the image version as helm parameter). For the Helm Chart I
think CalVer might be OK as I do not expect any branching/patches in the
future - I'd expect that there will be a single stream of releases.

3) Dockerfile (+ related files such as .dockerignore, empty dir,
entrypoints etc). i do not imagine a lot of branching for those - we
should be able to release a new version of a Dockerfile (+ related files)
working with nearly any earlier Airflow release, so CalVer seems like a
good choice.

4) Image versioning becomes a bit most complex because the image tag is
always combination of:
* Dockerfile (+ related files) version
* Airflow Version
* Python Version

An example versioning I can imagine:

*Airflow*: 1.10.11, 1.10.12, 2.0.0, 2.1.0, 2.1.1 - patch level (if we
decide to have patches).
*Dockerfile: *2020.07.12, 2020.08.20.. -> depending when we release them
*Helm Chart*: 2020.07.10, 2020.08.09 .. Each Helm Chart has a minimum
version of both Dockerfile and Airflow versions it works with.

*Example Docker Image tags:*
apache/airlflow:dockerfile2020.07.10-airflow1.10.10-python3.6

WDYT?

J,


On Wed, Jul 1, 2020 at 11:12 PM Kaxil Naik  wrote:

> I think we should have "separate repos for development" too.
>
> 3 Repos in total:
>
> 1) apache/airflow
> 2) apache/airflow-docker-image
> 3) apache/airflow-helm-chart
>
>
> (1) *apache/airflow* should use a pinned stable version of Airflow Helm
> chart to run Kubernetes tests
> (2) *apache/airflow* already has *Dockerfile.ci* file which it can use to
> run airflow tests on docker images.
> (3) *apache/airflow-docker-image *should use the latest available stable
> version of airflow
> (4) *apache/airflow-helm-chart *should use the latest available stable
> version of airflow
>
> Having such split also makes some updates more difficult - for example if
> > we add new "extra" to Airflow that will require to install "apt"
> dependency
> > in Dockerfile, we will have to split it into first adding the dependency
> to
> > Dockerfile, and once it is merged, we can add the extra to 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-01 Thread Jarek Potiuk
Sure. We can work with such an approach. There will be some dependencies
that we might find are problematic, but If we all see that it's
worth trying, there is a clear benefit that it makes for a "clean"
split between those different "entities". And possibly once we release
first versions of both image and chart, such problems will be rare and easy
to fix.

I personally think such split is inevitable eventually, it's just a matter
when to do it. If we decide to make this happen soon - I am more than happy
to work on making the split reality.

One prerequisite to that is that all those - Helm Chart, Prod Image and
Airflow are released in stable versions separately "officially" - from the
current sources (otherwise there will be no way to test cross-repo).

I think for that we will need to agree on the versioning scheme and cadence
for the Image and Helm Chart, then copy sources from airflow and release
them  as "baseline" including setup the tests for all of those - then we
can remove both Helm and Dockerfile from the airflow repo. Happy to help
with that if that's the direction we choose as a community. It is important
though that we keep the cross-repo testing working. We have it working as
of yesterday, so now the matter is - whatever we do we keep it running and
have development environment support easy development and testing  of
either of the three (including CI testing cross-repos) , That's the only
really important thing to me - the rest is more of technicality how we link
the repos, but principle remains.

Do we have an idea for the versioning scheme that we would like to use for
the Helm Chart and prod image ?

Should we make it CalVer  or SemVer
 (or some other scheme)?  And how should we treat the
combinations with Airflow?

My thoughts (but I have no strong opinions as long as someone proposes more
sensible versioning schemes):

1) Airflow code - we continue the release scheme we have (with deciding on
2.* scheme for the release). I expect in the future we might decide on
doing branches or patches so for 2.* I'd opt for going full SemVer approach
and patches released from branches.

2) I believe that Helm Chart can be versioned with its own version (then
you specify the image version as helm parameter). For the Helm Chart I
think CalVer might be OK as I do not expect any branching/patches in the
future - I'd expect that there will be a single stream of releases.

3) Dockerfile (+ related files such as .dockerignore, empty dir,
entrypoints etc).  i do not imagine a lot of branching for those - we
should be able to release a new version of a Dockerfile (+ related files)
working with nearly any earlier Airflow release, so CalVer seems like a
good choice.

4) Image versioning becomes a bit most complex because the image tag is
always combination of:
* Dockerfile (+ related files) version
* Airflow Version
* Python Version

An example versioning I can imagine:

*Airflow*: 1.10.11, 1.10.12, 2.0.0, 2.1.0, 2.1.1 - patch level (if we
decide to have patches).
*Dockerfile: *2020.07.12, 2020.08.20.. -> depending when we release them
*Helm Chart*: 2020.07.10, 2020.08.09 ..  Each Helm Chart has a minimum
version of both Dockerfile and Airflow versions it works with.

*Example Docker Image tags:*
 apache/airlflow:dockerfile2020.07.10-airflow1.10.10-python3.6

WDYT?

J,


On Wed, Jul 1, 2020 at 11:12 PM Kaxil Naik  wrote:

> I think we should have "separate repos for development" too.
>
> 3 Repos in total:
>
> 1) apache/airflow
> 2) apache/airflow-docker-image
> 3) apache/airflow-helm-chart
>
>
> (1) *apache/airflow* should use a pinned stable version of Airflow Helm
> chart to run Kubernetes tests
> (2) *apache/airflow* already has *Dockerfile.ci* file which it can use to
> run airflow tests on docker images.
> (3) *apache/airflow-docker-image *should use the latest available stable
> version of airflow
> (4) *apache/airflow-helm-chart *should use the latest available stable
> version of airflow
>
> Having such split also makes some updates more difficult - for example if
> > we add new "extra" to Airflow that will require to install "apt"
> dependency
> > in Dockerfile, we will have to split it into first adding the dependency
> to
> > Dockerfile, and once it is merged, we can add the extra to airflow with
> > setup.py.
>
>
> Adding a new extra to setup.py would not (and should not) impact the
> development of *apache/airflow-docker-image*
> Once an RC is cut for apache/airflow or after a new version is released for
> apache/airflow, we can work on supporting the new airflow version in the
> Production Docker Image.
> While doing that we can add all the libraries that are needed by the new
> Airflow Version and we will have a clean commit history and changelog for
> Docker image.
>
> We definitely do not need to work parallelly on both the repos. By doing
> development in a separate repo we keep consistent "source" files and we can
> release 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-01 Thread Jarek Potiuk
I do not think it's only the question of Mono/Multi repos. While I clearly
see the benefit of separate repos I also see some drawbacks.

And if it bothers others, I am happy to follow the majority. If we think
that a bit more complexity in testing justifies separating those three
completely and having more "clean"- it's also workable but IMHO introduces
certain complexity in development.

However I think this is not 0/1 a kind of Hybrid approach in my opinion
might be best of both worlds - development and releases .

Let me explain what I mean by "Hybrid":

I think we definitely should have separate repositories to release those
artifacts and I think there is no doubt about it:

* airflow (apache/airflow)
* prod docker image (apache/airflow-docker)
* helm chart (apache/airflow-helm)
* api clients (we already have separate repos for those)
(apache/airflow-client-*)

I think the only question is where we develop all those (develop !=
release). There are certain benefits of having a single "master" (let's
call it "development" further) for all those artifacts. Currently the
"development" version for all of those is in one repo - and while
developing one depends on the other, we also test all of those together and
this means that "current best" set of airflow sources (including
dependencies in setup.py), Dockerfile and Helm chart work. This means for
example that you will not be able to break the Helm Chart by changing
anything that the helm chart depends on in airflow. For example if you
change "airflow webserver" into "airflow server" the current helm chart
will break. Similarly if you change entrypoint,sh in Docker image in a way
that is not compatible with Helm chart, we will not let that happen - the
CI tests will break if either of those changes in an incompatible way. And
we can have dependencies in any direction between those three. When we see
a commit break either of the three - we can make a decision about what to
do - either accept and document the incompatibility or fix it.

Of course keeping that property (testing it all together) is also possible
if they are in completely separate repos. There are several
cross-dependencies - Docker image building depends on dependencies in
setup.py for example, you cannot build Docker image from only Dockerfile
without the sources of airflow nor build and test helm charts without the
image (and sources - because that's where the current kubernetes tests
are). If we want to continue doing it for both Helm and Dockerfile, we
would have to basically check out the latest sources of Airflow and run the
CI tests before merging any Docker or Helm Chart changes and the opposite -
we will have to download Dockerfile/Helm chart and build image/install Helm
chart when we are running CI tests for Airflow. This is possible and we
could do it, but it adds complexity to the build/CI process.

Having such split also makes some updates more difficult - for example if
we add new "extra" to Airflow that will require to install "apt" dependency
in Dockerfile, we will have to split it into first adding the dependency to
Dockerfile, and once it is merged, we can add the extra to airflow with
setup.py. This makes it quite difficult to test it together though (the
Dockerfile change can only be tested fully after merging it to master). Not
mentioning complexity of managing different versions - your local
development Dockerfile version vs sources of Airflow for example. Imagine
switching between branches where you add two different apt dependencies to
the Dockerfile. There are more similar scenarios I can imagine - especially
for parallel changes in those repos.

This is of course doable to keep them separate, but it is quite a bit more
complex to set up (especially for a consistent development environment)
when you have separate repos and prevent cross-breaking changes might be
more difficult.

I believe that the best way is to continue developing airflow + image +
chart in one repo - airflow, but release them from those separate repos.

Airflow source release does not have to contain neither chart, nor image.
And even if it contains sources for those, they are not the final
"artifacts" (installable image and installable helm chart).
Whenever we decide to release either of them - we test it in "development".
Then only when it is tested, we copy the sources to those separate repos
and release them.

With git -  we can even do it very easily while preserving history of
commits easily (been there, done that). And then we could release Helm and
Docker image separately based on the commits and tags in those separate
repositories.

I agree that separate repos is a more "clean" approach. But I think it is
less convenient for development consistency.

J,



On Wed, Jul 1, 2020 at 9:35 PM Kaxil Naik  wrote:

> Forgot to mention, having them in separate repo also helps in better
> managing each individual artifacts.
>
> Each repo would have a separate Github Issue where we can track the issue
> 

Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-01 Thread Deng Xiaodong
Thanks Kaxil.

Both ideas (having separate repos and having separate voting) sounds sensible 
to me. Cannot really think of significant drawbacks.


XD


> On 1 Jul 2020, at 9:35 PM, Kaxil Naik  wrote:
> 
> Forgot to mention, having them in separate repo also helps in better
> managing each individual artifacts.
> 
> Each repo would have a separate Github Issue where we can track the issue
> specific to Helm chart or Dockerfile.
> 
> Regards,
> Kaxil
> 
> On Wed, Jul 1, 2020 at 8:30 PM Kaxil Naik  wrote:
> 
>> The PMC also needs to agree if we want separate VOTING for Docker Image
>> and Helm chart, I think we do.
>> 
>> Regards,
>> Kaxil
>> 
>> On Wed, Jul 1, 2020 at 8:06 PM Kaxil Naik  wrote:
>> 
>>> Hi all,
>>> 
>>> What do you all think about having Dockerfile and Helm chart in the same
>>> "Airflow" Repo vs separate?
>>> 
>>> I feel having a separate repo for Airflow Dockerfile and Helm chart have
>>> more benefits like easy to track changes (via Changelog), easy for new
>>> contributors, separate release cadence.
>>> 
>>> Currently, docker file and Helm Chart are inside the same repo and when
>>> we release changelog for a new Airflow version, it would include all
>>> changes (Airflow + Dockerfile + Helm chart) which I think is not that great.
>>> 
>>> Also having them all inside a single repo means changes in Helm Chart and
>>> Dockerfile can block Airflow release. We could use stable Helm Chart
>>> version and Dockerfile version to test Airflow so that they are blockers to
>>> release too.
>>> 
>>> Happy to hear the thoughts from the community.
>>> 
>>> Regards,
>>> Kaxil
>>> 
>> 



Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-01 Thread Kaxil Naik
Forgot to mention, having them in separate repo also helps in better
managing each individual artifacts.

Each repo would have a separate Github Issue where we can track the issue
specific to Helm chart or Dockerfile.

Regards,
Kaxil

On Wed, Jul 1, 2020 at 8:30 PM Kaxil Naik  wrote:

> The PMC also needs to agree if we want separate VOTING for Docker Image
> and Helm chart, I think we do.
>
> Regards,
> Kaxil
>
> On Wed, Jul 1, 2020 at 8:06 PM Kaxil Naik  wrote:
>
>> Hi all,
>>
>> What do you all think about having Dockerfile and Helm chart in the same
>> "Airflow" Repo vs separate?
>>
>> I feel having a separate repo for Airflow Dockerfile and Helm chart have
>> more benefits like easy to track changes (via Changelog), easy for new
>> contributors, separate release cadence.
>>
>> Currently, docker file and Helm Chart are inside the same repo and when
>> we release changelog for a new Airflow version, it would include all
>> changes (Airflow + Dockerfile + Helm chart) which I think is not that great.
>>
>> Also having them all inside a single repo means changes in Helm Chart and
>> Dockerfile can block Airflow release. We could use stable Helm Chart
>> version and Dockerfile version to test Airflow so that they are blockers to
>> release too.
>>
>> Happy to hear the thoughts from the community.
>>
>> Regards,
>> Kaxil
>>
>


Re: Separate Repo vs MonoRepo for Dockerfile & Helm Chart

2020-07-01 Thread Kaxil Naik
The PMC also needs to agree if we want separate VOTING for Docker Image and
Helm chart, I think we do.

Regards,
Kaxil

On Wed, Jul 1, 2020 at 8:06 PM Kaxil Naik  wrote:

> Hi all,
>
> What do you all think about having Dockerfile and Helm chart in the same
> "Airflow" Repo vs separate?
>
> I feel having a separate repo for Airflow Dockerfile and Helm chart have
> more benefits like easy to track changes (via Changelog), easy for new
> contributors, separate release cadence.
>
> Currently, docker file and Helm Chart are inside the same repo and when we
> release changelog for a new Airflow version, it would include all changes
> (Airflow + Dockerfile + Helm chart) which I think is not that great.
>
> Also having them all inside a single repo means changes in Helm Chart and
> Dockerfile can block Airflow release. We could use stable Helm Chart
> version and Dockerfile version to test Airflow so that they are blockers to
> release too.
>
> Happy to hear the thoughts from the community.
>
> Regards,
> Kaxil
>