potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787506971


   Hey @ashb - we need bigger machines as I suspected :) . 
   
   The good news is that it will be much cheaper in the long run as we will 
need them for far less time.
   
   The tests are failing but mainly because of memory problems and timeouts (so 
I guess we are simply using too much of RAM , if we up the machine to 64 GB I 
think this should go rather smoothly. The good news is that even with not 
enough memory (and with failures/timeouts) the tests took ~26 m (!) for 
`sqlite` - rather than > 1 h, so when we have enough memory we can achieve the 
15 minutes I was hoping for. Those 64 GB machines are only a bit more expensive 
than the 32 GB ones, so we will save a lot of credits when it works. 
   
   We can even optimize it away a bit and have two self-hosted types:
   
   1) Big 64 GB ones for the tests
   2) Smaller 32 GB ones  for everything else
   
   It should be rather easy to configure in the `CI.yml`, but I am not sure if 
the auto-scaling solution we have will handle two types?
   
   Here is the job that we have partial successes/failures and it shows how 
those tests will look like. This is actually a good  one to show how the tests 
will look like.  You can see that the output is nicely grouped and you can see 
very clearly the monitoring and progress (it will be much nicer when we have 
more memory because each test will progress much faster). Also I print summary 
of the failed tests at the end - only "failure" outputs are fully printed to 
the logs at the end with "Red" groups - this will make it far easier to analyze 
problems (the same kind of output improvement is in the sequential version of 
the tests run on GitHub runners).
   
   In this  case three test types succeeded (`Heisentests Core Providers`) and 
remaining 5 had some failures (most of them from what I see is due to timeouts, 
which is perfectly understandable if we run out of memory and started to swap 
out to remote SSD in the cloud): 
https://github.com/apache/airflow/pull/14531/checks?check_run_id=1999175947. 
   
   ## Rationale for bigger machines
   
   From what I see we have machines with 32 GB and since half of it will easily 
be eaten by `tmpfs` when we start writing logs and the like, we only have ~16 
GB which is not enough. During my tests: 
https://twitter.com/higrys/status/1366037359461101569/photo/1 all the tests 
running in parallel took ~35 GB of memory on my 64 GB machine. I had just local 
SSD not `tmpfs` for those tests, but i do not think we need 30 GB tmpfs for all 
logs, docker, tmp etc (and we can fine tune that if we do).  
   
   Also it is more important than before to clean-up the `tmpfs` volumes before 
each run and make them "pristine" for every run - because we will be using 
nearly all of it. I think that will also help with cases like #14505 where some 
left-overs from previous runs are causing the jobs to fail.
   
   This is the mchine state before the tests are run: 
   
   ```
                   total        used        free      shared  buff/cache   
available
     Mem:           30Gi       696Mi        24Gi       3.4Gi       5.5Gi        
26Gi
     Swap:            0B          0B          0B
   
     Filesystem      Size  Used Avail Use% Mounted on
     /dev/root       7.7G  2.6G  5.2G  34% /
     devtmpfs         16G     0   16G   0% /dev
     tmpfs            16G     0   16G   0% /dev/shm
     tmpfs           3.1G  804K  3.1G   1% /run
     tmpfs           5.0M     0  5.0M   0% /run/lock
     tmpfs            16G     0   16G   0% /sys/fs/cgroup
     tmpfs           3.1G  168K  3.1G   1% /tmp
     tmpfs            21G  2.9G   18G  15% /var/lib/docker
     tmpfs            16G  534M   15G   4% /home/runner/actions-runner/_work
     /dev/loop0       98M   98M     0 100% /snap/core/10185
     /dev/loop1       56M   56M     0 100% /snap/core18/1885
     /dev/loop2       71M   71M     0 100% /snap/lxd/16922
     /dev/loop3       29M   29M     0 100% /snap/amazon-ssm-agent/2012
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to