*tl;dr*

   1. The GitHub Action for building docs is running out of space. I think
   we should archive really old documentation for large packages to cloud
   storage.
   2. Contributing to and building Airflow docs is hard. We should migrate
   to a framework, preferably one that uses markdown (although I acknowledge
   rst -> md will be a massive overhaul).

*Problem Summary*
I recently set out to implement what I thought would be a straightforward
feature: warn users when they are viewing documentation for non-current
versions of Airflow and link them to the current/stable version
<https://github.com/apache/airflow/pull/34639>. Jed pointed me to the
airflow-site <https://github.com/apache/airflow-site> repo, which contains
all of the archived docs (that is, documentation for non-current versions),
and from there, I ran into a brick wall.

I want to raise some concerns that I've developed after trying to
contribute what feel like a couple reasonably small docs updates:

   1. airflow-site
      1. Elad pointed out the problem posed by the sheer size of archived
      docs
      
<https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943>
(more
      on this later).
      2. The airflow-site repo is confusing, and rather poorly documented.
         1. Hugo (static site generator) exists, but appears to only be
         used for the landing pages
         2. In order to view any documentation locally other than the
         landing pages, you'll need to run the site.sh script then
copy the output
         from one dir to another?
      3. All of the archived docs are raw HTML, making migrating to a
      static site generator a significant challenge, which makes it
difficult to
      prevent the archived docs from continuing to grow and grow.
Perhaps this is the
      wheel Khaleesi was referring to
      <https://www.youtube.com/watch?v=J-rxmk6zPxA>?
   2. airflow
      1. Building Airflow docs is a challenge. It takes several minutes and
      doesn't support auto-build, so the slightest issue could require waiting
      again and again until the changes are just so. I tried implementing
      sphinx-autobuild <https://github.com/executablebooks/sphinx-autobuild>
      to no avail.
      2. Sphinx/restructured text has a steep learning curve.

*The most acute issue: disk space*
The size of the archived docs is causing the docs build GitHub Action to
almost run out of space. From the "Build site" Action from a couple weeks
ago
<https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458>
(expand
the build site step, scroll all the way to the bottom, expand the `df -h`
command), we can see the GitHub Action runner (or whatever it's called) is
nearly running out of space:

df -h
  *Filesystem      Size  Used Avail Use% Mounted on*
  /dev/root        84G   82G  2.1G  98% /


The available space is down to 1.8G on the most recent Action
<https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176>.
If we assume that trend is accurate, we have about two months before the
Action runner runs out of disk space. Here's a breakdown of the space
consumed by the 10 largest package documentation directories:

du -h -d 1 docs-archive/ | sort -h -r
* 14G* docs-archive/
*4.0G* docs-archive//apache-airflow-providers-google
*3.2G* docs-archive//apache-airflow
*1.7G* docs-archive//apache-airflow-providers-amazon
*560M* docs-archive//apache-airflow-providers-microsoft-azure
*254M* docs-archive//apache-airflow-providers-cncf-kubernetes
*192M* docs-archive//apache-airflow-providers-apache-hive
*153M* docs-archive//apache-airflow-providers-snowflake
*139M* docs-archive//apache-airflow-providers-databricks
*104M* docs-archive//apache-airflow-providers-docker
*101M* docs-archive//apache-airflow-providers-mysql


*Proposed solution: Archive old docs html for large packages to cloud
storage*
I'm wondering if it would be reasonable to truly archive the docs for some
of the older versions of these packages. Perhaps the last 18 months? Maybe
we could drop the html in a blob storage bucket with instructions for
building the docs if absolutely necessary?

*Improving docs building moving forward*
There's an open Issue <https://github.com/apache/airflow-site/issues/719> for
migrating the docs to a framework, but it's not at all a straightforward
task for the archived docs. I think that we should institute a policy of
archiving old documentation to cloud storage after X time and use a
framework for building docs in a scalable and sustainable way moving
forward. Maybe we could chat with iceberg folks about how they moved from
mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616>


Shoutout to Utkarsh for helping me through all this!

Reply via email to