I am not happy about removing "old" docs. There are still users on older versions but given the situation I am not sure what other option we have.
Maybe we should cut from a specific provider rather than from all of them? Why does Google provider consume 4 GB and Amazon 1.7 GB? Is there a specific part of the providers that occupy most of the space? Maybe the auto generated files for all classes? On Thu, Oct 19, 2023 at 1:38 PM Jarek Potiuk <ja...@potiuk.com> wrote: > Yes. Moving the old version to somewhere that we can keep/archive static > historical versions of those historical docs and publish them from there. > What you proposed is exactly the solution I thought might be best as well. > > It would be a great task to contribute to the stability of our docs > generation in the future. > > I don't think it's a matter of discussing in detail how to do it (18 months > is a good start and you can parameterize it), It's the matter of > someone committing to it and doing it simply :). > > So yes I personally am all for it and if I understand correctly that you > are looking for agreement on doing it, big +1 from my side - happy to help > with providing access to our S3 buckets. > > J. > > On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter > <ryan.hat...@astronomer.io.invalid> wrote: > > > *tl;dr* > > > > 1. The GitHub Action for building docs is running out of space. I > think > > we should archive really old documentation for large packages to cloud > > storage. > > 2. Contributing to and building Airflow docs is hard. We should > migrate > > to a framework, preferably one that uses markdown (although I > > acknowledge > > rst -> md will be a massive overhaul). > > > > *Problem Summary* > > I recently set out to implement what I thought would be a straightforward > > feature: warn users when they are viewing documentation for non-current > > versions of Airflow and link them to the current/stable version > > <https://github.com/apache/airflow/pull/34639>. Jed pointed me to the > > airflow-site <https://github.com/apache/airflow-site> repo, which > contains > > all of the archived docs (that is, documentation for non-current > versions), > > and from there, I ran into a brick wall. > > > > I want to raise some concerns that I've developed after trying to > > contribute what feel like a couple reasonably small docs updates: > > > > 1. airflow-site > > 1. Elad pointed out the problem posed by the sheer size of archived > > docs > > < > > > https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943 > > > > > (more > > on this later). > > 2. The airflow-site repo is confusing, and rather poorly > documented. > > 1. Hugo (static site generator) exists, but appears to only be > > used for the landing pages > > 2. In order to view any documentation locally other than the > > landing pages, you'll need to run the site.sh script then > > copy the output > > from one dir to another? > > 3. All of the archived docs are raw HTML, making migrating to a > > static site generator a significant challenge, which makes it > > difficult to > > prevent the archived docs from continuing to grow and grow. > > Perhaps this is the > > wheel Khaleesi was referring to > > <https://www.youtube.com/watch?v=J-rxmk6zPxA>? > > 2. airflow > > 1. Building Airflow docs is a challenge. It takes several minutes > and > > doesn't support auto-build, so the slightest issue could require > > waiting > > again and again until the changes are just so. I tried implementing > > sphinx-autobuild < > > https://github.com/executablebooks/sphinx-autobuild> > > to no avail. > > 2. Sphinx/restructured text has a steep learning curve. > > > > *The most acute issue: disk space* > > The size of the archived docs is causing the docs build GitHub Action to > > almost run out of space. From the "Build site" Action from a couple weeks > > ago > > < > > > https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458 > > > > > (expand > > the build site step, scroll all the way to the bottom, expand the `df -h` > > command), we can see the GitHub Action runner (or whatever it's called) > is > > nearly running out of space: > > > > df -h > > *Filesystem Size Used Avail Use% Mounted on* > > /dev/root 84G 82G 2.1G 98% / > > > > > > The available space is down to 1.8G on the most recent Action > > < > > > https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176 > > >. > > If we assume that trend is accurate, we have about two months before the > > Action runner runs out of disk space. Here's a breakdown of the space > > consumed by the 10 largest package documentation directories: > > > > du -h -d 1 docs-archive/ | sort -h -r > > * 14G* docs-archive/ > > *4.0G* docs-archive//apache-airflow-providers-google > > *3.2G* docs-archive//apache-airflow > > *1.7G* docs-archive//apache-airflow-providers-amazon > > *560M* docs-archive//apache-airflow-providers-microsoft-azure > > *254M* docs-archive//apache-airflow-providers-cncf-kubernetes > > *192M* docs-archive//apache-airflow-providers-apache-hive > > *153M* docs-archive//apache-airflow-providers-snowflake > > *139M* docs-archive//apache-airflow-providers-databricks > > *104M* docs-archive//apache-airflow-providers-docker > > *101M* docs-archive//apache-airflow-providers-mysql > > > > > > *Proposed solution: Archive old docs html for large packages to cloud > > storage* > > I'm wondering if it would be reasonable to truly archive the docs for > some > > of the older versions of these packages. Perhaps the last 18 months? > Maybe > > we could drop the html in a blob storage bucket with instructions for > > building the docs if absolutely necessary? > > > > *Improving docs building moving forward* > > There's an open Issue <https://github.com/apache/airflow-site/issues/719 > > > > for > > migrating the docs to a framework, but it's not at all a straightforward > > task for the archived docs. I think that we should institute a policy of > > archiving old documentation to cloud storage after X time and use a > > framework for building docs in a scalable and sustainable way moving > > forward. Maybe we could chat with iceberg folks about how they moved from > > mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616> > > > > > > Shoutout to Utkarsh for helping me through all this! > > >