pankajkoti commented on issue #773:
URL: https://github.com/apache/airflow-site/issues/773#issuecomment-1525852259

   I got a chance today to read more about our setup in the repo and studied 
the build.yml, site.sh scripts.
   
   ### Understanding so far
   The build jobs create a `dist` folder when we run the `./site.sh build-site` 
command. The size of the `dist` folder is roughly `10.3GB` in the current main 
branch when I build it locally. There is a `docs` folder in the `dist` folder 
which itself occupies most of the space and it also reads at `~10.3GB` at the 
moment. So all other directories occupy minimum space relative to the `docs` 
folder.
   
   `du` on root folder
   <img width="685" alt="Screenshot 2023-04-27 at 7 44 06 PM" 
src="https://user-images.githubusercontent.com/10206082/234889559-4e0c367f-1926-40d0-bffb-dc0eb7460a8b.png";>
   
   `du` on the `dist` folder
   <img width="601" alt="Screenshot 2023-04-27 at 7 44 30 PM" 
src="https://user-images.githubusercontent.com/10206082/234889667-b5b104e5-e883-44df-a78c-113c5a5131d8.png";>
   
   `du` on the `dist/docs` folder
   <img width="763" alt="Screenshot 2023-04-27 at 8 17 37 PM" 
src="https://user-images.githubusercontent.com/10206082/234900059-326951fb-baae-49dd-846c-7cc1d19af6fc.png";>
   
   
   Github runners guarantee that they provide at least 14GB for the runs 
https://github.com/actions/runner-images/issues/2840#issuecomment-791177163.
   
   What I understood is when we create a PR, in the [line in our CI 
build](https://github.com/apache/airflow-site/blob/7f3efc9d50467c79509952c095efb8afc429289e/.github/workflows/build.yml#L66),
 the docs folder is removed before proceeding to the next steps and as a 
result, the build job when creating PRs would hardly fail. 
   But when we try to merge the PR and merge it to the main this huge `docs` 
folder is not removed and when we tried to deploy the website here: 
https://github.com/apache/airflow-site/blob/7f3efc9d50467c79509952c095efb8afc429289e/.github/workflows/build.yml#L88,
  the [`Deploy website on asf-site branch` github action 
job](https://github.com/apache/airflow-site/actions/runs/4243773508/jobs/7376961974)
 failed  with disk out of space issue while copying the `dist` folder to the 
`gh-pages` branch of our repository using the wrapper action 
`apache/airflow-JamesIves-github-pages-deploy-action`(The base action is 
https://github.com/JamesIves/github-pages-deploy-action). 
   
   I believe our website is deployed from the `gh-pages` branch and all the 
content that is available in it gets published as per my chat with ChatGPT :) 
   
   ### Solution Proposal (Theory)
   
   We can replicate the setup including the CI and files/folder from this repo 
into https://github.com/apache/airflow-site-archive with the following tweaks.
   1. Split and copy a few sets of files from our 
[docs-archive](https://github.com/apache/airflow-site/tree/7f3efc9d50467c79509952c095efb8afc429289e/docs-archive)
 folder which gets translated to `docs` folder while building (occupying this 
huge space ~10.3GB) to the new repo with either of the below approaches:
   a. Keep certain providers in this repo and the rest providers in the new 
repo based on the sizing of the providers' wrt. to space they occupy as can be 
seen in the above screenshot for the `dist/docs` directory
   b. Keep all providers in both repos but split them by versions, meaning keep 
the latest versions here and the older versions in the new repo
   2. Have the site build / CI build only generate the `dist` folder with the 
docs we plan to keep in each repo.
   3. Set the target repository for the build job in the new repo to point to 
the `gh-pages` branch of this repo. Upon reading the [options for the 
action](https://github.com/JamesIves/github-pages-deploy-action#optional-choices),
 I believe, we can set the `repository-name` with the needed token in the new 
repo pointing to this repo.
   4. Ensure that 
https://github.com/apache/airflow-site/blob/7f3efc9d50467c79509952c095efb8afc429289e/.github/workflows/build.yml#L95
 is set to `False` in this repo as otherwise the docs that are not in this repo 
but in the new repo will be cleaned out when CI is run in this repo on merge to 
main. Alternatively, set `clean-exclude` (again based on the [options available 
in the GitHub action 
job](https://github.com/JamesIves/github-pages-deploy-action#optional-choices)) 
in this repo's CI build to not clean such files that are in the new repo.
   
   With the above steps, I believe we will be able to have all the docs in this 
same repo's `gh-pages` branch and we would need to worry about additional 
changes in the JS/CSS files of the repo.
   
   ### Next steps
   The above proposal is all a theory based on my understanding so far and 
would like to hear opinions on this. Would like to hear if someone already 
knows whether this approach could make sense, is feasible/achievable or if we 
could sense some issues/blockers here.
   
   Would really appreciate your time in reading this comment and would also 
appreciate if you have pointers on who we could reach out to more for seeking 
feedback/additional expert advice.
   
   @potiuk @jedcunningham @phanikumv @mik-laj 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to