I reimaged the beam15. The worker is re-enabled. Let us know if anything
weird happens on any agent.

Thanks.
Yifan

On Mon, Jul 1, 2019 at 10:00 AM Yifan Zou <[email protected]> wrote:

> https://issues.apache.org/jira/browse/BEAM-7650 tracks the docker issue.
>
> On Sun, Jun 30, 2019 at 2:35 PM Mark Liu <[email protected]> wrote:
>
>> Thank you for triaging and working out a solution Yifan and Ankur.
>>
>> Ankur, from what you discovered, we should fix this race condition
>> otherwise same problem will happen in the future. Is there a jira tracking
>> this issue?
>>
>> On Fri, Jun 28, 2019 at 4:56 PM Yifan Zou <[email protected]> wrote:
>>
>>> Sorry for the inconvenience. I disabled the worker. I'll need more time
>>> to restore it.
>>>
>>> On Fri, Jun 28, 2019 at 3:56 PM Daniel Oliveira <[email protected]>
>>> wrote:
>>>
>>>> Any updates to this issue today? It seems like this (or a similar bug)
>>>> is still happening across many Pre and Postcommits.
>>>>
>>>> On Fri, Jun 28, 2019 at 12:33 AM Yifan Zou <[email protected]> wrote:
>>>>
>>>>> I did the prune on beam15. The disk was free but all jobs fails with
>>>>> other weird problems. Looks like docker prune overkills, but I don't have
>>>>> evidence. Will look further in AM.
>>>>>
>>>>> On Thu, Jun 27, 2019 at 11:20 PM Udi Meiri <[email protected]> wrote:
>>>>>
>>>>>> See how the hdfs IT already avoids tag collisions.
>>>>>>
>>>>>> On Thu, Jun 27, 2019, 20:42 Yichi Zhang <[email protected]> wrote:
>>>>>>
>>>>>>> for flakiness I guess a tag is needed to separate concurrent build
>>>>>>> apart.
>>>>>>>
>>>>>>> On Thu, Jun 27, 2019 at 8:39 PM Yichi Zhang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> maybe a cron job on jenkins node that does docker prune every day?
>>>>>>>>
>>>>>>>> On Thu, Jun 27, 2019 at 6:58 PM Ankur Goenka <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This highlights the race condition caused by using single docker
>>>>>>>>> registry on a machine.
>>>>>>>>> If 2 tests create "jenkins-docker-apache.bintray.io/beam/python" one
>>>>>>>>> after another then the 2nd one will replace the 1st one and cause 
>>>>>>>>> flakyness.
>>>>>>>>>
>>>>>>>>> Is their a way to dynamically create and destroy docker repository
>>>>>>>>> on a machine and clean all the relevant data?
>>>>>>>>>
>>>>>>>>> On Thu, Jun 27, 2019 at 3:15 PM Yifan Zou <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The problem was because of the large quantity of stale docker
>>>>>>>>>> images generated by the Python portable tests and HDFS IT.
>>>>>>>>>>
>>>>>>>>>> Dumping the docker disk usage gives me:
>>>>>>>>>>
>>>>>>>>>> TYPE                TOTAL               ACTIVE              SIZE
>>>>>>>>>>                RECLAIMABLE
>>>>>>>>>> *Images              1039                356
>>>>>>>>>> 424GB               384.2GB (90%)*
>>>>>>>>>> Containers          987                 2
>>>>>>>>>> 2.042GB             2.041GB (99%)
>>>>>>>>>> Local Volumes       126                 0
>>>>>>>>>> 392.8MB             392.8MB (100%)
>>>>>>>>>>
>>>>>>>>>> REPOSITORY
>>>>>>>>>>                     TAG                 IMAGE ID            CREATED
>>>>>>>>>>     SIZE                SHARED SIZE         UNIQUE SIZE         
>>>>>>>>>> CONTAINERS
>>>>>>>>>> jenkins-docker-apache.bintray.io/beam/python3
>>>>>>>>>>      latest              ff1b949f4442        22 hours ago        
>>>>>>>>>> 1.639GB
>>>>>>>>>>           922.3MB                  716.9MB             0
>>>>>>>>>> jenkins-docker-apache.bintray.io/beam/python
>>>>>>>>>>        latest              1dda7b9d9748        22 hours ago        
>>>>>>>>>> 1.624GB
>>>>>>>>>>             913.7MB               710.3MB             0
>>>>>>>>>> <none>
>>>>>>>>>>                            <none>              05458187a0e3        
>>>>>>>>>> 22 hours
>>>>>>>>>> ago        732.9MB             625.1MB            107.8MB            
>>>>>>>>>>  4
>>>>>>>>>> <none>
>>>>>>>>>>                            <none>              896f35dd685f        
>>>>>>>>>> 23 hours
>>>>>>>>>> ago        1.639GB             922.3MB               716.9MB         
>>>>>>>>>>     0
>>>>>>>>>> <none>
>>>>>>>>>>                            <none>              db4d24ca9f2b        
>>>>>>>>>> 23 hours
>>>>>>>>>> ago        1.624GB             913.7MB              710.3MB          
>>>>>>>>>>    0
>>>>>>>>>> <none>
>>>>>>>>>>                             <none>              547df4d71c31        
>>>>>>>>>> 23
>>>>>>>>>> hours ago        732.9MB             625.1MB             107.8MB
>>>>>>>>>>   4
>>>>>>>>>> <none>
>>>>>>>>>>                             <none>              dd7d9582c3e0        
>>>>>>>>>> 23
>>>>>>>>>> hours ago        1.639GB             922.3MB             716.9MB
>>>>>>>>>>   0
>>>>>>>>>> <none>
>>>>>>>>>>                             <none>              664aae255239        
>>>>>>>>>> 23
>>>>>>>>>> hours ago        1.624GB             913.7MB             710.3MB
>>>>>>>>>>   0
>>>>>>>>>> <none>
>>>>>>>>>>                             <none>              b528fedf9228        
>>>>>>>>>> 23
>>>>>>>>>> hours ago        732.9MB             625.1MB             107.8MB
>>>>>>>>>>   4
>>>>>>>>>> <none>
>>>>>>>>>>                             <none>              8e996f22435e        
>>>>>>>>>> 25
>>>>>>>>>> hours ago        1.624GB             913.7MB            710.3MB
>>>>>>>>>> 0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-818_test
>>>>>>>>>> latest              24b73b3fec06        25 hours ago        1.305GB
>>>>>>>>>>     965.7MB               339.5MB             0
>>>>>>>>>> <none>
>>>>>>>>>>                             <none>              096325fb48de       
>>>>>>>>>> 25 hours
>>>>>>>>>> ago        732.9MB             625.1MB            107.8MB            
>>>>>>>>>>   2
>>>>>>>>>> jenkins-docker-apache.bintray.io/beam/java
>>>>>>>>>>          latest              c36d8ff2945d          25 hours ago
>>>>>>>>>>  685.6MB             625.1MB               60.52MB             0
>>>>>>>>>> <none>
>>>>>>>>>>                             <none>              11c86ebe025f        
>>>>>>>>>> 26
>>>>>>>>>> hours ago        1.639GB             922.3MB              716.9MB
>>>>>>>>>>   0
>>>>>>>>>> <none>
>>>>>>>>>>                             <none>              2ecd69c89ec1        
>>>>>>>>>> 26
>>>>>>>>>> hours ago        1.624GB             913.7MB             710.3MB
>>>>>>>>>>   0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8590_test
>>>>>>>>>>  latest              3d1d589d44fe        2 days ago          1.305GB
>>>>>>>>>>       965.7MB               339.5MB             0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-801_test
>>>>>>>>>>  latest              d1cc503ebe8e        2 days ago          1.305GB
>>>>>>>>>>       965.7MB             339.2MB             0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8577_test
>>>>>>>>>>  latest              8582c6ca6e15        3 days ago          1.305GB
>>>>>>>>>>       965.7MB              339.2MB             0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8576_test
>>>>>>>>>>  latest              4591e0948170        3 days ago          1.305GB
>>>>>>>>>>       965.7MB              339.2MB             0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8575_test
>>>>>>>>>>  latest              ab181c49d56e        4 days ago          1.305GB
>>>>>>>>>>       965.7MB              339.2MB             0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8573_test
>>>>>>>>>>  latest              2104ba0a6db7        4 days ago          1.305GB
>>>>>>>>>>       965.7MB              339.2MB             0
>>>>>>>>>> ...
>>>>>>>>>> <1000+ images>
>>>>>>>>>>
>>>>>>>>>> I removed unused the images and the beam15 is back now.
>>>>>>>>>>
>>>>>>>>>> Opened https://issues.apache.org/jira/browse/BEAM-7650.
>>>>>>>>>> Ankur, I assigned the issue to you. Feel free to reassign it if
>>>>>>>>>> needed.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>> Yifan
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 27, 2019 at 11:29 AM Yifan Zou <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Something were eating the disk. Disconnected the worker so jobs
>>>>>>>>>>> could be allocated to other nodes. Will look deeper.
>>>>>>>>>>> Filesystem      Size  Used  Avail Use% Mounted on
>>>>>>>>>>> /dev/sda1       485G  485G 96K 100%  /
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 27, 2019 at 10:54 AM Yifan Zou <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm on it.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jun 27, 2019 at 10:17 AM Udi Meiri <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Opened a bug here:
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-7648
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can someone investigate what's going on?
>>>>>>>>>>>>>
>>>>>>>>>>>>

Reply via email to