Cool, thanks!

On Wed, 10 Nov 2021, 18:14 Jarek Potiuk, <[email protected]> wrote:

> HA! I FOUND IT!
>
> It's not the bacfill_job. It's the `test_kubernetes_executor.py' and not
> even that - it's the code coverage plugin when test_kubernetes_executor.py
> is running that takes a lot of memory.
>
> When you run `test_kubernetes_executor.py` it can take a lot of memory -
> (2-3GB) and does not free it even if the kubernetes tests are completed,
> and it remains taken for the subsequent tests to run.
>
> It seems that the coverage plugin keeps a loooooot of data in memory about
> the code coverage resulting from those test. When I disable code coverage
> the memory remaining after test_kubernetes_executor is ~ 700 MB (!)
>
> I am going to disable the coverage for PR. It's very rarely looked at and
> actually only the main one makes sense, because only there we have
> guarantee of running all tests.
>
> J.
>
> On Wed, Nov 10, 2021 at 6:37 PM Jarek Potiuk <[email protected]> wrote:
>
>> It seems much less frequently but is there.
>>
>> And the culprit is - most-likely -  `test_backfill_job.py`.
>>
>> In this case just before it failed it took 77% of the memory (5.3 GB):
>>
>>   ########### STATISTICS #################
>>   CONTAINER ID   NAME                                          CPU %
>> MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
>>   520e4674350d   airflow-core-mysql_airflow_run_37e48532d683   171.93%
>> 5.24GiB / 6.789GiB    77.18%    33.5MB / 20.4MB   144MB / 152kB    152
>>   eedf0f4b9f1e   airflow-core-mysql_mysql_1                    0.09%
>> 108.3MiB / 6.789GiB   1.56%     20.4MB / 33.5MB   22.7MB / 530MB   32
>>
>>                 total        used        free      shared  buff/cache
>> available
>>   Mem:           6951        6754         125           6          72
>>      16
>>   Swap:             0           0           0
>>
>>   Filesystem      Size  Used Avail Use% Mounted on
>>   /dev/root        84G   54G   30G  65% /
>>   /dev/sdb15      105M  5.2M  100M   5% /boot/efi
>>   /dev/sda1        14G  4.1G  9.0G  32% /mnt
>>   ########### STATISTICS #################
>>   ### The last 2 lines for Core process:
>> /tmp/tmp.7v6Wm3jIOT/tests/Core/stdout ###
>>   tests/executors/test_sequential_executor.py .
>>  [ 49%]
>>   tests/jobs/test_backfill_job.py ..
>>
>> Likely should be enough to investigate or maybe mitigate it somehow and
>> add some extra cleanups between tests if the memory is not freed between
>> them.
>>
>> J.
>>
>>
>>
>> On Wed, Nov 10, 2021 at 6:11 PM Khalid Mammadov <
>> [email protected]> wrote:
>>
>>> Just to let you know.
>>>
>>> Issue looks like is still there:
>>>
>>> https://github.com/apache/airflow/runs/4167464563?check_suite_focus=true
>>>
>>>
>>>
>>> On 10/11/2021 13:40, Jarek Potiuk wrote:
>>>
>>> Merged!  Please rebase (Khalid- you can remove the workaround of yours)
>>> and let me know.
>>>
>>> There is one failure that happened in my tests:
>>>
>>> https://github.com/apache/airflow/runs/4165358689?check_suite_focus=true
>>> - but we can observe results of this one and try to find the reason
>>> separately if it continues to repeat.
>>>
>>> J.
>>>
>>> On Wed, Nov 10, 2021 at 12:49 PM Jarek Potiuk <[email protected]> wrote:
>>>
>>>> Fix being tested in: https://github.com/apache/airflow/pull/19512
>>>> (committer PR) and https://github.com/apache/airflow/pull/19514
>>>> (regular user PR).
>>>>
>>>>
>>>> On Wed, Nov 10, 2021 at 11:25 AM Jarek Potiuk <[email protected]> wrote:
>>>>
>>>>> OK. I took a look . It looks like indeed "core" tests" are briefly
>>>>> (and sometimes for  a longer time) pass over 50% of memory available on
>>>>> Github Runners. I do not think optimizing them now makes little sense -
>>>>> because even if we optimize them now, they will likely soon again reach
>>>>> 50-60% of available memory, which - when ther are other parallel tests
>>>>> running might easily get OOM.
>>>>>
>>>>> It looks like those are only "Core" type of tests so the solution will
>>>>> be (similarly as with "Integration" tests) to separate them out to a
>>>>> non-parallel run for github runners.
>>>>>
>>>>> On Tue, Nov 9, 2021 at 9:33 PM Jarek Potiuk <[email protected]> wrote:
>>>>>
>>>>>> Yep. Apparently one of the recent tests is using too much memory. I
>>>>>> had some private errands that made me less available last few days - but 
>>>>>> I
>>>>>> will have time to catch-up tonight/tomorrow.
>>>>>>
>>>>>> Thanks for changing the "parallel" level in your PR - that will give
>>>>>> me more datapoints. I've just re-run both PRs with "debug-ci-resources"
>>>>>> label. This is our "debug" label to show resource use during the build 
>>>>>> and
>>>>>> i might be able to find and fix the root cause.
>>>>>>
>>>>>> For the future - in case any other committer wants to investigate it,
>>>>>> setting the "debug-ci-resources" labels turns on the debugging mode 
>>>>>> showing
>>>>>> this information periodically  alongside the progress of tests - it can 
>>>>>> be
>>>>>> helpful in determining what caused the OOM:
>>>>>>
>>>>>> CONTAINER ID   NAME                                            CPU %
>>>>>>     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         
>>>>>> PIDS
>>>>>> c46832148ff7   airflow-always-mssql_airflow_run_e59b6039c3d8   99.59%
>>>>>>    365.1MiB / 6.789GiB   5.25%     1.62MB / 3.33MB   8.97MB / 20.5kB   8
>>>>>> f4d2a192d6fc   airflow-always-mssql_mssqlsetup_1               0.00%
>>>>>>     0B / 0B               0.00%     0B / 0B           0B / 0B           0
>>>>>> a668cdedc717   airflow-api-mssql_airflow_run_bcc466077ac0      35.07%
>>>>>>    431.4MiB / 6.789GiB   6.21%     2.26MB / 4.47MB   73.2MB / 20.5kB   8
>>>>>> f306f4221ba1   airflow-api-mssql_mssqlsetup_1                  0.00%
>>>>>>     0B / 0B               0.00%     0B / 0B           0B / 0B           0
>>>>>> 7f10748e9496   airflow-api-mssql_mssql_1                       30.66%
>>>>>>    735.5MiB / 6.789GiB   10.58%    4.47MB / 2.26MB   36.8MB / 124MB    
>>>>>> 132
>>>>>> 8b5ca767ed0c   airflow-always-mssql_mssql_1                    12.59%
>>>>>>    716.5MiB / 6.789GiB   10.31%    3.33MB / 1.63MB   36.7MB / 52.7MB   
>>>>>> 131
>>>>>>
>>>>>>               total        used        free      shared  buff/cache
>>>>>> available
>>>>>> Mem:           6951        2939         200           6        3811
>>>>>>      3702
>>>>>> Swap:             0           0           0
>>>>>>
>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>> /dev/root        84G   51G   33G  61% /
>>>>>> /dev/sda15      105M  5.2M  100M   5% /boot/efi
>>>>>> /dev/sdb1        14G  4.1G  9.0G  32% /mnt
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 9, 2021 at 9:19 PM Oliveira, Niko
>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>
>>>>>>> Hey all,
>>>>>>>
>>>>>>>
>>>>>>> Just to throw another data point in the ring, I've had a PR
>>>>>>> <https://github.com/apache/airflow/pull/19410> stuck in the same
>>>>>>> way as well. Several retries are all failing with the same OOM.
>>>>>>>
>>>>>>>
>>>>>>> I've also dug through the Github Actions history and found a few
>>>>>>> others. So it doesn't seem to be just a one-off.
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Niko
>>>>>>> ------------------------------
>>>>>>> *From:* Khalid Mammadov <[email protected]>
>>>>>>> *Sent:* Tuesday, November 9, 2021 6:24 AM
>>>>>>> *To:* [email protected]
>>>>>>> *Subject:* [EXTERNAL] OOM issue in the CI
>>>>>>>
>>>>>>>
>>>>>>> *CAUTION*: This email originated from outside of the organization.
>>>>>>> Do not click links or open attachments unless you can confirm the sender
>>>>>>> and know the content is safe.
>>>>>>>
>>>>>>> Hi Devs,
>>>>>>>
>>>>>>> I have been working on below PR for and run into OOM issue during
>>>>>>> testing on GitHub actions (you can see in commit history).
>>>>>>>
>>>>>>> https://github.com/apache/airflow/pull/19139/files
>>>>>>>
>>>>>>> The tests for databases Postgres, MySQL etc. fails due to OOM and
>>>>>>> docker gets killed.
>>>>>>>
>>>>>>> I have reduced parallelism to 1 "in the code" *temporarily* (the
>>>>>>> only extra change in the PR) and it passes all the checks which confirms
>>>>>>> the issue.
>>>>>>>
>>>>>>>
>>>>>>> I was hoping if you could advise the best course of action in this
>>>>>>> situation so I can force parallelism to 1 to get all checks passed or 
>>>>>>> some
>>>>>>> other way to solve OOM?
>>>>>>>
>>>>>>> Any help would be appreciated.
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>>
>>>>>>> Khalid
>>>>>>>
>>>>>>

Reply via email to