pankajkoti commented on issue #773: URL: https://github.com/apache/airflow-site/issues/773#issuecomment-1525852259
I got a chance today to read more about our setup in the repo and studied the build.yml, site.sh scripts. ### Understanding so far The build jobs create a `dist` folder when we run the `./site.sh build-site` command. The size of the `dist` folder is roughly `10.3GB` in the current main branch when I build it locally. There is a `docs` folder in the `dist` folder which itself occupies most of the space and it also reads at `~10.3GB` at the moment. So all other directories occupy minimum space relative to the `docs` folder. `du` on root folder <img width="685" alt="Screenshot 2023-04-27 at 7 44 06 PM" src="https://user-images.githubusercontent.com/10206082/234889559-4e0c367f-1926-40d0-bffb-dc0eb7460a8b.png"> `du` on the `dist` folder <img width="601" alt="Screenshot 2023-04-27 at 7 44 30 PM" src="https://user-images.githubusercontent.com/10206082/234889667-b5b104e5-e883-44df-a78c-113c5a5131d8.png"> `du` on the `dist/docs` folder <img width="763" alt="Screenshot 2023-04-27 at 8 17 37 PM" src="https://user-images.githubusercontent.com/10206082/234900059-326951fb-baae-49dd-846c-7cc1d19af6fc.png"> Github runners guarantee that they provide at least 14GB for the runs https://github.com/actions/runner-images/issues/2840#issuecomment-791177163. What I understood is when we create a PR, in the [line in our CI build](https://github.com/apache/airflow-site/blob/7f3efc9d50467c79509952c095efb8afc429289e/.github/workflows/build.yml#L66), the docs folder is removed before proceeding to the next steps and as a result, the build job when creating PRs would hardly fail. But when we try to merge the PR and merge it to the main this huge `docs` folder is not removed and when we tried to deploy the website here: https://github.com/apache/airflow-site/blob/7f3efc9d50467c79509952c095efb8afc429289e/.github/workflows/build.yml#L88, the [`Deploy website on asf-site branch` github action job](https://github.com/apache/airflow-site/actions/runs/4243773508/jobs/7376961974) failed with disk out of space issue while copying the `dist` folder to the `gh-pages` branch of our repository using the wrapper action `apache/airflow-JamesIves-github-pages-deploy-action`(The base action is https://github.com/JamesIves/github-pages-deploy-action). I believe our website is deployed from the `gh-pages` branch and all the content that is available in it gets published as per my chat with ChatGPT :) ### Solution Proposal (Theory) We can replicate the setup including the CI and files/folder from this repo into https://github.com/apache/airflow-site-archive with the following tweaks. 1. Split and copy a few sets of files from our [docs-archive](https://github.com/apache/airflow-site/tree/7f3efc9d50467c79509952c095efb8afc429289e/docs-archive) folder which gets translated to `docs` folder while building (occupying this huge space ~10.3GB) to the new repo with either of the below approaches: a. Keep certain providers in this repo and the rest providers in the new repo based on the sizing of the providers' wrt. to space they occupy as can be seen in the above screenshot for the `dist/docs` directory b. Keep all providers in both repos but split them by versions, meaning keep the latest versions here and the older versions in the new repo 2. Have the site build / CI build only generate the `dist` folder with the docs we plan to keep in each repo. 3. Set the target repository for the build job in the new repo to point to the `gh-pages` branch of this repo. Upon reading the [options for the action](https://github.com/JamesIves/github-pages-deploy-action#optional-choices), I believe, we can set the `repository-name` with the needed token in the new repo pointing to this repo. 4. Ensure that https://github.com/apache/airflow-site/blob/7f3efc9d50467c79509952c095efb8afc429289e/.github/workflows/build.yml#L95 is set to `False` in this repo as otherwise the docs that are not in this repo but in the new repo will be cleaned out when CI is run in this repo on merge to main. Alternatively, set `clean-exclude` (again based on the [options available in the GitHub action job](https://github.com/JamesIves/github-pages-deploy-action#optional-choices)) in this repo's CI build to not clean such files that are in the new repo. With the above steps, I believe we will be able to have all the docs in this same repo's `gh-pages` branch and we would need to worry about additional changes in the JS/CSS files of the repo. ### Next steps The above proposal is all a theory based on my understanding so far and would like to hear opinions on this. Would like to hear if someone already knows whether this approach could make sense, is feasible/achievable or if we could sense some issues/blockers here. Would really appreciate your time in reading this comment and would also appreciate if you have pointers on who we could reach out to more for seeking feedback/additional expert advice. @potiuk @jedcunningham @phanikumv @mik-laj -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
