I am not happy about removing "old" docs.
There are still users on older versions but given the situation I am not
sure what other option we have.

Maybe we should cut from a specific provider rather than from all of them?
Why does Google provider consume 4 GB and Amazon 1.7 GB?
Is there a specific part of the providers that occupy most of the space?
Maybe the auto generated files for all classes?


On Thu, Oct 19, 2023 at 1:38 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Yes. Moving the old version to somewhere that we can keep/archive static
> historical versions of those historical docs and publish them from there.
> What you proposed is exactly the solution I thought might be best as well.
>
> It would be a great task to contribute to the stability of our docs
> generation in the future.
>
> I don't think it's a matter of discussing in detail how to do it (18 months
> is a good start and you can parameterize it), It's the matter of
> someone committing to it and doing it simply :).
>
> So yes I personally am all for it and if I understand correctly that you
> are looking for agreement on doing it, big +1 from my side - happy to help
> with providing access to our S3 buckets.
>
> J.
>
> On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter
> <ryan.hat...@astronomer.io.invalid> wrote:
>
> > *tl;dr*
> >
> >    1. The GitHub Action for building docs is running out of space. I
> think
> >    we should archive really old documentation for large packages to cloud
> >    storage.
> >    2. Contributing to and building Airflow docs is hard. We should
> migrate
> >    to a framework, preferably one that uses markdown (although I
> > acknowledge
> >    rst -> md will be a massive overhaul).
> >
> > *Problem Summary*
> > I recently set out to implement what I thought would be a straightforward
> > feature: warn users when they are viewing documentation for non-current
> > versions of Airflow and link them to the current/stable version
> > <https://github.com/apache/airflow/pull/34639>. Jed pointed me to the
> > airflow-site <https://github.com/apache/airflow-site> repo, which
> contains
> > all of the archived docs (that is, documentation for non-current
> versions),
> > and from there, I ran into a brick wall.
> >
> > I want to raise some concerns that I've developed after trying to
> > contribute what feel like a couple reasonably small docs updates:
> >
> >    1. airflow-site
> >       1. Elad pointed out the problem posed by the sheer size of archived
> >       docs
> >       <
> >
> https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943
> > >
> > (more
> >       on this later).
> >       2. The airflow-site repo is confusing, and rather poorly
> documented.
> >          1. Hugo (static site generator) exists, but appears to only be
> >          used for the landing pages
> >          2. In order to view any documentation locally other than the
> >          landing pages, you'll need to run the site.sh script then
> > copy the output
> >          from one dir to another?
> >       3. All of the archived docs are raw HTML, making migrating to a
> >       static site generator a significant challenge, which makes it
> > difficult to
> >       prevent the archived docs from continuing to grow and grow.
> > Perhaps this is the
> >       wheel Khaleesi was referring to
> >       <https://www.youtube.com/watch?v=J-rxmk6zPxA>?
> >    2. airflow
> >       1. Building Airflow docs is a challenge. It takes several minutes
> and
> >       doesn't support auto-build, so the slightest issue could require
> > waiting
> >       again and again until the changes are just so. I tried implementing
> >       sphinx-autobuild <
> > https://github.com/executablebooks/sphinx-autobuild>
> >       to no avail.
> >       2. Sphinx/restructured text has a steep learning curve.
> >
> > *The most acute issue: disk space*
> > The size of the archived docs is causing the docs build GitHub Action to
> > almost run out of space. From the "Build site" Action from a couple weeks
> > ago
> > <
> >
> https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458
> > >
> > (expand
> > the build site step, scroll all the way to the bottom, expand the `df -h`
> > command), we can see the GitHub Action runner (or whatever it's called)
> is
> > nearly running out of space:
> >
> > df -h
> >   *Filesystem      Size  Used Avail Use% Mounted on*
> >   /dev/root        84G   82G  2.1G  98% /
> >
> >
> > The available space is down to 1.8G on the most recent Action
> > <
> >
> https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176
> > >.
> > If we assume that trend is accurate, we have about two months before the
> > Action runner runs out of disk space. Here's a breakdown of the space
> > consumed by the 10 largest package documentation directories:
> >
> > du -h -d 1 docs-archive/ | sort -h -r
> > * 14G* docs-archive/
> > *4.0G* docs-archive//apache-airflow-providers-google
> > *3.2G* docs-archive//apache-airflow
> > *1.7G* docs-archive//apache-airflow-providers-amazon
> > *560M* docs-archive//apache-airflow-providers-microsoft-azure
> > *254M* docs-archive//apache-airflow-providers-cncf-kubernetes
> > *192M* docs-archive//apache-airflow-providers-apache-hive
> > *153M* docs-archive//apache-airflow-providers-snowflake
> > *139M* docs-archive//apache-airflow-providers-databricks
> > *104M* docs-archive//apache-airflow-providers-docker
> > *101M* docs-archive//apache-airflow-providers-mysql
> >
> >
> > *Proposed solution: Archive old docs html for large packages to cloud
> > storage*
> > I'm wondering if it would be reasonable to truly archive the docs for
> some
> > of the older versions of these packages. Perhaps the last 18 months?
> Maybe
> > we could drop the html in a blob storage bucket with instructions for
> > building the docs if absolutely necessary?
> >
> > *Improving docs building moving forward*
> > There's an open Issue <https://github.com/apache/airflow-site/issues/719
> >
> > for
> > migrating the docs to a framework, but it's not at all a straightforward
> > task for the archived docs. I think that we should institute a policy of
> > archiving old documentation to cloud storage after X time and use a
> > framework for building docs in a scalable and sustainable way moving
> > forward. Maybe we could chat with iceberg folks about how they moved from
> > mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616>
> >
> >
> > Shoutout to Utkarsh for helping me through all this!
> >
>

Reply via email to