One reason to have a monorepo is for project branding, and end user
experience. But for component development experience, it's nice to have a
small, dedicated repo.

I think the git submodule approach is technically sound, but is at odds
with making the project easy to consume/understand from the end user
perspective, especially if we expand the use of subprojects. And the main
Airflow commit graph would appear to be slowing down which is bad for
Airflow brand perception.

Kubernetes has many sub-repos that are integrated into the main repo -
which I think could be the best of both worlds:
Example: https://github.com/kubernetes/kubernetes/tree/master/staging

I haven't dug in very deeply, and I won't pretend to understand how
challenging it may be to maintain this structure, but I'd support breaking
more components out of the main Airflow repo for dev purposes (for example,
in the future, it'd be nice to have airflow-cli, airflow-api,
airflow-scheduler, individual provider repos that are cleanly separated) as
long as we bring the commits/contributions back into the monorepo with
automation.

Maybe we could dive a little deeper into how K8s is operating, before going
with submodules?

-Ry




On Thu, Jul 2, 2020 at 11:24 AM Kaxil Naik <kaxiln...@gmail.com> wrote:

> Let's come to a consensus first before we do anything :-)
>
> Is everyone happy with separate repo approach? Let's wait for 72 hours to
> hear from all and then have a plan on how we do it? WDYT?
>
> But indeed git submodules approach sounds good. We do it for for *Airflow
> Site *(
>
> https://github.com/apache/airflow-site/tree/master/landing-pages/site/themes
> )
> too.
>
> Regards,
> Kaxil
>
> On Thu, Jul 2, 2020 at 4:15 PM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > Absolutely - I am happy to add "best practices" and short "howto do stuff
> > with git submodules"  - and this knowledge will only be needed for
> > interacting with prod image/helmchart/running kubernetes tests. For all
> the
> > other purposes it should be "business as usual".
> >
> > On Thu, Jul 2, 2020 at 4:53 PM Daniel Imberman <
> daniel.imber...@gmail.com>
> > wrote:
> >
> > > I think git submodules sounds like a great idea. We would need to write
> > > this into the CONTRIBUTING.md to let people know how to do it but It’s
> a
> > > “teach once” situation.
> > >
> > > via Newton Mail [
> > >
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.14.6&source=email_footer_2
> > > ]
> > > On Thu, Jul 2, 2020 at 2:44 AM, Tomasz Urbaszek <turbas...@apache.org>
> > > wrote:
> > > I support the idea of separate repos. The git submodules mentioned by
> > > Jarek sounds like an interesting solution. It may add some complexity
> > > for new contributors but it's not rocket science. If we agree on using
> > > this we should add small how-to in contributing.rst I think (i.e. do I
> > > have to have fork of each repo?).
> > >
> > > As stressed previously if we go this route we should make sure we have
> > > nice testing of all those three components. Regarding the versioning,
> > > I have no strong opinion but I fully support using separate issues for
> > > airflow, docker, and helm.
> > >
> > > Tomek
> > >
> > >
> > > On Thu, Jul 2, 2020 at 9:26 AM Jarek Potiuk <jarek.pot...@polidea.com>
> > > wrote:
> > > >
> > > > On Thu, Jul 2, 2020 at 3:16 AM Daniel Imberman <
> > > daniel.imber...@gmail.com>
> > > > wrote:
> > > >
> > > > I’m fine with keeping it as three separate repos but merging testing
> > > > > somehow (e.g. the source code chart would pull the helm/docker
> chart
> > > into
> > > > > .build) but we need to do it in a way that doesn’t make testing too
> > > > > difficult.
> > > > >
> > > > > So for example: How do I test/integration test a change that
> > involves a
> > > > > change to all three and has to be done at the same time? Perhaps a
> > > user can
> > > > > “register” a branch of helm and docker when they start up breeze?
> Or
> > > > > perhaps we create a “parent” integration test that uses the three
> > > together?
> > > > >
> > > >
> > > > Yes, those are exactly my concerns when splitting the repos.
> > > >
> > > > I think testing for development should remain in the "airflow" repo.
> It
> > > is
> > > > the "central one" in fact. I slept it over and I think using
> "released"
> > > > versions for development testing will suffer from this "we need a
> > change
> > > in
> > > > all three of those".
> > > >
> > > > But we have an easy solution I think.
> > > >
> > > > I think that simply setting submodules properly should do to the job:
> > > > https://git-scm.com/book/en/v2/Git-Tools-Submodules. They seem to be
> > > > perfect for our case.
> > > >
> > > > For those who have not used it - in short - submodules work in the
> way
> > > that
> > > > they register the "linked repos" and store related "hash" of the
> commit
> > > > from that linked repo. For example, the "chart" folder will be a link
> > to
> > > > "apache/airflow-helm-chart". We can also move the prod Dockerfile to
> a
> > > > subfolder and link it to the separate repo. Git submodule has a
> > > > built-in mechanism to a) update to the latest version of the repo, b)
> > > > commit your changes to the linked repo from there which is all we
> > need. I
> > > > used those few times - I never liked submodules for sharing "library"
> > > code,
> > > > but for sharing helm/Docker It seems perfect.
> > > >
> > > > From the "regular" developer point of view - you do not need to
> > > get/update
> > > > submodules if you do not need to use them - so for all the
> development
> > > > purposes if you only change the "airflow" code, you would not even
> need
> > > to
> > > > sync chart or Dockerfile. You do "git checkout" as usual and it
> should
> > > > work. So basically - no change for "regular" airflow development.
> > > >
> > > > However, if you do need to work on helm + Docker + code, then you
> > simply
> > > to
> > > > "git submodule update", go to the linked "helm" or "docker" folder,
> > > > checkout the "master" version and you start making changes. The only
> > > thing
> > > > to remember when you want to push your changes is to do `git push
> > > > --recurse-sumbodules="check" ` and it will make sure that all the
> repos
> > > are
> > > > updated, It is a bit involved, but latest git version have a very
> good
> > > > support and it must only be used by people who work on airflow +
> > docker +
> > > > helm - all the others are unaffected.
> > > >
> > > > From the CI perspective also nothing changes - when we checkout the
> > code
> > > we
> > > > will include submodules and our test harness will be largely
> unchanged.
> > > > Submodule provides us with the right mechanism for cross dependency
> > even
> > > if
> > > > we use branches.
> > > >
> > > > If everyone will be ok with that - I am happy to set it up, With
> > > submodules
> > > > - we can switch to separate repos even without releasing helm and
> Prod
> > > > chart "officially".
> > > >
> > > > J.
> > > >
> > > >
> > > >
> > > > >
> > > > > via Newton Mail [
> > > > >
> > >
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.14.6&source=email_footer_2
> > > > > ]
> > > > > On Wed, Jul 1, 2020 at 3:20 PM, Jarek Potiuk <
> > jarek.pot...@polidea.com
> > > >
> > > > > wrote:
> > > > > Sure. We can work with such an approach. There will be some
> > > dependencies
> > > > > that we might find are problematic, but If we all see that it's
> > > > > worth trying, there is a clear benefit that it makes for a "clean"
> > > > > split between those different "entities". And possibly once we
> > release
> > > > > first versions of both image and chart, such problems will be rare
> > and
> > > easy
> > > > > to fix.
> > > > >
> > > > > I personally think such split is inevitable eventually, it's just a
> > > matter
> > > > > when to do it. If we decide to make this happen soon - I am more
> than
> > > happy
> > > > > to work on making the split reality.
> > > > >
> > > > > One prerequisite to that is that all those - Helm Chart, Prod Image
> > and
> > > > > Airflow are released in stable versions separately "officially" -
> > from
> > > the
> > > > > current sources (otherwise there will be no way to test
> cross-repo).
> > > > >
> > > > > I think for that we will need to agree on the versioning scheme and
> > > cadence
> > > > > for the Image and Helm Chart, then copy sources from airflow and
> > > release
> > > > > them as "baseline" including setup the tests for all of those -
> then
> > we
> > > > > can remove both Helm and Dockerfile from the airflow repo. Happy to
> > > help
> > > > > with that if that's the direction we choose as a community. It is
> > > important
> > > > > though that we keep the cross-repo testing working. We have it
> > working
> > > as
> > > > > of yesterday, so now the matter is - whatever we do we keep it
> > running
> > > and
> > > > > have development environment support easy development and testing
> of
> > > > > either of the three (including CI testing cross-repos) , That's the
> > > only
> > > > > really important thing to me - the rest is more of technicality how
> > we
> > > link
> > > > > the repos, but principle remains.
> > > > >
> > > > > Do we have an idea for the versioning scheme that we would like to
> > use
> > > for
> > > > > the Helm Chart and prod image ?
> > > > >
> > > > > Should we make it CalVer <https://calver.org/overview.html> or
> > SemVer
> > > > > <https://semver.org/> (or some other scheme)? And how should we
> > treat
> > > the
> > > > > combinations with Airflow?
> > > > >
> > > > > My thoughts (but I have no strong opinions as long as someone
> > proposes
> > > more
> > > > > sensible versioning schemes):
> > > > >
> > > > > 1) Airflow code - we continue the release scheme we have (with
> > > deciding on
> > > > > 2.* scheme for the release). I expect in the future we might decide
> > on
> > > > > doing branches or patches so for 2.* I'd opt for going full SemVer
> > > approach
> > > > > and patches released from branches.
> > > > >
> > > > > 2) I believe that Helm Chart can be versioned with its own version
> > > (then
> > > > > you specify the image version as helm parameter). For the Helm
> Chart
> > I
> > > > > think CalVer might be OK as I do not expect any branching/patches
> in
> > > the
> > > > > future - I'd expect that there will be a single stream of releases.
> > > > >
> > > > > 3) Dockerfile (+ related files such as .dockerignore, empty dir,
> > > > > entrypoints etc). i do not imagine a lot of branching for those -
> we
> > > > > should be able to release a new version of a Dockerfile (+ related
> > > files)
> > > > > working with nearly any earlier Airflow release, so CalVer seems
> > like a
> > > > > good choice.
> > > > >
> > > > > 4) Image versioning becomes a bit most complex because the image
> tag
> > is
> > > > > always combination of:
> > > > > * Dockerfile (+ related files) version
> > > > > * Airflow Version
> > > > > * Python Version
> > > > >
> > > > > An example versioning I can imagine:
> > > > >
> > > > > *Airflow*: 1.10.11, 1.10.12, 2.0.0, 2.1.0, 2.1.1 - patch level (if
> we
> > > > > decide to have patches).
> > > > > *Dockerfile: *2020.07.12, 2020.08.20...... -> depending when we
> > release
> > > > > them
> > > > > *Helm Chart*: 2020.07.10, 2020.08.09 ...... Each Helm Chart has a
> > > minimum
> > > > > version of both Dockerfile and Airflow versions it works with.
> > > > >
> > > > > *Example Docker Image tags:*
> > > > > apache/airlflow:dockerfile2020.07.10-airflow1.10.10-python3.6
> > > > >
> > > > > WDYT?
> > > > >
> > > > > J,
> > > > >
> > > > >
> > > > > On Wed, Jul 1, 2020 at 11:12 PM Kaxil Naik <kaxiln...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I think we should have "separate repos for development" too.
> > > > > >
> > > > > > 3 Repos in total:
> > > > > >
> > > > > > 1) apache/airflow
> > > > > > 2) apache/airflow-docker-image
> > > > > > 3) apache/airflow-helm-chart
> > > > > >
> > > > > >
> > > > > > (1) *apache/airflow* should use a pinned stable version of
> Airflow
> > > Helm
> > > > > > chart to run Kubernetes tests
> > > > > > (2) *apache/airflow* already has *Dockerfile.ci* file which it
> can
> > > use to
> > > > > > run airflow tests on docker images.
> > > > > > (3) *apache/airflow-docker-image *should use the latest available
> > > stable
> > > > > > version of airflow
> > > > > > (4) *apache/airflow-helm-chart *should use the latest available
> > > stable
> > > > > > version of airflow
> > > > > >
> > > > > > Having such split also makes some updates more difficult - for
> > > example if
> > > > > > > we add new "extra" to Airflow that will require to install
> "apt"
> > > > > > dependency
> > > > > > > in Dockerfile, we will have to split it into first adding the
> > > > > dependency
> > > > > > to
> > > > > > > Dockerfile, and once it is merged, we can add the extra to
> > airflow
> > > with
> > > > > > > setup.py.
> > > > > >
> > > > > >
> > > > > > Adding a new extra to setup.py would not (and should not) impact
> > the
> > > > > > development of *apache/airflow-docker-image*
> > > > > > Once an RC is cut for apache/airflow or after a new version is
> > > released
> > > > > for
> > > > > > apache/airflow, we can work on supporting the new airflow version
> > in
> > > the
> > > > > > Production Docker Image.
> > > > > > While doing that we can add all the libraries that are needed by
> > the
> > > new
> > > > > > Airflow Version and we will have a clean commit history and
> > > changelog for
> > > > > > Docker image.
> > > > > >
> > > > > > We definitely do not need to work parallelly on both the repos.
> By
> > > doing
> > > > > > development in a separate repo we keep consistent "source" files
> > and
> > > we
> > > > > can
> > > > > > release each artifact with a
> > > > > > separate cadence. If someone discovers bug in newly released
> > > Dockerimage,
> > > > > > we should be easily able to cut out a new release with the patch
> > > without
> > > > > > worrying about how development is
> > > > > > going in the apache/airflow repo.
> > > > > >
> > > > > >
> > > > > > *Apache Flink & Apache CoucheDB *does it in the similar manner:
> > > > > >
> > > > > > https://github.com/apache/flink &
> > > https://github.com/apache/flink-docker
> > > > > > https://github.com/apache/couchdb &
> > > > > > https://github.com/apache/couchdb-docker
> > > > > >
> > > > > > Regards,
> > > > > > Kaxil
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jul 1, 2020 at 9:50 PM Jarek Potiuk <
> > > jarek.pot...@polidea.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I do not think it's only the question of Mono/Multi repos.
> While
> > I
> > > > > > clearly
> > > > > > > see the benefit of separate repos I also see some drawbacks.
> > > > > > >
> > > > > > > And if it bothers others, I am happy to follow the majority. If
> > we
> > > > > think
> > > > > > > that a bit more complexity in testing justifies separating
> those
> > > three
> > > > > > > completely and having more "clean"- it's also workable but IMHO
> > > > > > introduces
> > > > > > > certain complexity in development.
> > > > > > >
> > > > > > > However I think this is not 0/1 a kind of Hybrid approach in my
> > > opinion
> > > > > > > might be best of both worlds - development and releases .
> > > > > > >
> > > > > > > Let me explain what I mean by "Hybrid":
> > > > > > >
> > > > > > > I think we definitely should have separate repositories to
> > release
> > > > > those
> > > > > > > artifacts and I think there is no doubt about it:
> > > > > > >
> > > > > > > * airflow (apache/airflow)
> > > > > > > * prod docker image (apache/airflow-docker)
> > > > > > > * helm chart (apache/airflow-helm)
> > > > > > > * api clients (we already have separate repos for those)
> > > > > > > (apache/airflow-client-*)
> > > > > > >
> > > > > > > I think the only question is where we develop all those
> (develop
> > !=
> > > > > > > release). There are certain benefits of having a single
> "master"
> > > (let's
> > > > > > > call it "development" further) for all those artifacts.
> Currently
> > > the
> > > > > > > "development" version for all of those is in one repo - and
> while
> > > > > > > developing one depends on the other, we also test all of those
> > > together
> > > > > > and
> > > > > > > this means that "current best" set of airflow sources
> (including
> > > > > > > dependencies in setup.py), Dockerfile and Helm chart work. This
> > > means
> > > > > for
> > > > > > > example that you will not be able to break the Helm Chart by
> > > changing
> > > > > > > anything that the helm chart depends on in airflow. For example
> > if
> > > you
> > > > > > > change "airflow webserver" into "airflow server" the current
> helm
> > > chart
> > > > > > > will break. Similarly if you change entrypoint,sh in Docker
> image
> > > in a
> > > > > > way
> > > > > > > that is not compatible with Helm chart, we will not let that
> > > happen -
> > > > > the
> > > > > > > CI tests will break if either of those changes in an
> incompatible
> > > way.
> > > > > > And
> > > > > > > we can have dependencies in any direction between those three.
> > > When we
> > > > > > see
> > > > > > > a commit break either of the three - we can make a decision
> about
> > > what
> > > > > to
> > > > > > > do - either accept and document the incompatibility or fix it.
> > > > > > >
> > > > > > > Of course keeping that property (testing it all together) is
> also
> > > > > > possible
> > > > > > > if they are in completely separate repos. There are several
> > > > > > > cross-dependencies - Docker image building depends on
> > dependencies
> > > in
> > > > > > > setup.py for example, you cannot build Docker image from only
> > > > > Dockerfile
> > > > > > > without the sources of airflow nor build and test helm charts
> > > without
> > > > > the
> > > > > > > image (and sources - because that's where the current
> kubernetes
> > > tests
> > > > > > > are). If we want to continue doing it for both Helm and
> > > Dockerfile, we
> > > > > > > would have to basically check out the latest sources of Airflow
> > > and run
> > > > > > the
> > > > > > > CI tests before merging any Docker or Helm Chart changes and
> the
> > > > > > opposite -
> > > > > > > we will have to download Dockerfile/Helm chart and build
> > > image/install
> > > > > > Helm
> > > > > > > chart when we are running CI tests for Airflow. This is
> possible
> > > and we
> > > > > > > could do it, but it adds complexity to the build/CI process.
> > > > > > >
> > > > > > > Having such split also makes some updates more difficult - for
> > > example
> > > > > if
> > > > > > > we add new "extra" to Airflow that will require to install
> "apt"
> > > > > > dependency
> > > > > > > in Dockerfile, we will have to split it into first adding the
> > > > > dependency
> > > > > > to
> > > > > > > Dockerfile, and once it is merged, we can add the extra to
> > airflow
> > > with
> > > > > > > setup.py. This makes it quite difficult to test it together
> > though
> > > (the
> > > > > > > Dockerfile change can only be tested fully after merging it to
> > > master).
> > > > > > Not
> > > > > > > mentioning complexity of managing different versions - your
> local
> > > > > > > development Dockerfile version vs sources of Airflow for
> example.
> > > > > Imagine
> > > > > > > switching between branches where you add two different apt
> > > dependencies
> > > > > > to
> > > > > > > the Dockerfile. There are more similar scenarios I can imagine
> -
> > > > > > especially
> > > > > > > for parallel changes in those repos.
> > > > > > >
> > > > > > > This is of course doable to keep them separate, but it is
> quite a
> > > bit
> > > > > > more
> > > > > > > complex to set up (especially for a consistent development
> > > environment)
> > > > > > > when you have separate repos and prevent cross-breaking changes
> > > might
> > > > > be
> > > > > > > more difficult.
> > > > > > >
> > > > > > > I believe that the best way is to continue developing airflow +
> > > image +
> > > > > > > chart in one repo - airflow, but release them from those
> separate
> > > > > repos.
> > > > > > >
> > > > > > > Airflow source release does not have to contain neither chart,
> > nor
> > > > > image.
> > > > > > > And even if it contains sources for those, they are not the
> final
> > > > > > > "artifacts" (installable image and installable helm chart).
> > > > > > > Whenever we decide to release either of them - we test it in
> > > > > > "development".
> > > > > > > Then only when it is tested, we copy the sources to those
> > separate
> > > > > repos
> > > > > > > and release them.
> > > > > > >
> > > > > > > With git - we can even do it very easily while preserving
> history
> > > of
> > > > > > > commits easily (been there, done that). And then we could
> release
> > > Helm
> > > > > > and
> > > > > > > Docker image separately based on the commits and tags in those
> > > separate
> > > > > > > repositories.
> > > > > > >
> > > > > > > I agree that separate repos is a more "clean" approach. But I
> > > think it
> > > > > is
> > > > > > > less convenient for development consistency.
> > > > > > >
> > > > > > > J,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jul 1, 2020 at 9:35 PM Kaxil Naik <kaxiln...@gmail.com
> >
> > > wrote:
> > > > > > >
> > > > > > > > Forgot to mention, having them in separate repo also helps in
> > > better
> > > > > > > > managing each individual artifacts.
> > > > > > > >
> > > > > > > > Each repo would have a separate Github Issue where we can
> track
> > > the
> > > > > > issue
> > > > > > > > specific to Helm chart or Dockerfile.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Kaxil
> > > > > > > >
> > > > > > > > On Wed, Jul 1, 2020 at 8:30 PM Kaxil Naik <
> kaxiln...@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > The PMC also needs to agree if we want separate VOTING for
> > > Docker
> > > > > > Image
> > > > > > > > > and Helm chart, I think we do.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Kaxil
> > > > > > > > >
> > > > > > > > > On Wed, Jul 1, 2020 at 8:06 PM Kaxil Naik <
> > kaxiln...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hi all,
> > > > > > > > >>
> > > > > > > > >> What do you all think about having Dockerfile and Helm
> chart
> > > in
> > > > > the
> > > > > > > same
> > > > > > > > >> "Airflow" Repo vs separate?
> > > > > > > > >>
> > > > > > > > >> I feel having a separate repo for Airflow Dockerfile and
> > Helm
> > > > > chart
> > > > > > > have
> > > > > > > > >> more benefits like easy to track changes (via Changelog),
> > > easy for
> > > > > > new
> > > > > > > > >> contributors, separate release cadence.
> > > > > > > > >>
> > > > > > > > >> Currently, docker file and Helm Chart are inside the same
> > > repo and
> > > > > > > when
> > > > > > > > >> we release changelog for a new Airflow version, it would
> > > include
> > > > > all
> > > > > > > > >> changes (Airflow + Dockerfile + Helm chart) which I think
> is
> > > not
> > > > > > that
> > > > > > > > great.
> > > > > > > > >>
> > > > > > > > >> Also having them all inside a single repo means changes in
> > > Helm
> > > > > > Chart
> > > > > > > > and
> > > > > > > > >> Dockerfile can block Airflow release. We could use stable
> > Helm
> > > > > Chart
> > > > > > > > >> version and Dockerfile version to test Airflow so that
> they
> > > are
> > > > > > > > blockers to
> > > > > > > > >> release too.
> > > > > > > > >>
> > > > > > > > >> Happy to hear the thoughts from the community.
> > > > > > > > >>
> > > > > > > > >> Regards,
> > > > > > > > >> Kaxil
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Jarek Potiuk
> > > > > > > Polidea <https://www.polidea.com/> | Principal Software
> Engineer
> > > > > > >
> > > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>

Reply via email to