https://issues.apache.org/jira/browse/BEAM-7650 tracks the docker issue.
On Sun, Jun 30, 2019 at 2:35 PM Mark Liu <mark...@google.com> wrote: > Thank you for triaging and working out a solution Yifan and Ankur. > > Ankur, from what you discovered, we should fix this race condition > otherwise same problem will happen in the future. Is there a jira tracking > this issue? > > On Fri, Jun 28, 2019 at 4:56 PM Yifan Zou <yifan...@google.com> wrote: > >> Sorry for the inconvenience. I disabled the worker. I'll need more time >> to restore it. >> >> On Fri, Jun 28, 2019 at 3:56 PM Daniel Oliveira <danolive...@google.com> >> wrote: >> >>> Any updates to this issue today? It seems like this (or a similar bug) >>> is still happening across many Pre and Postcommits. >>> >>> On Fri, Jun 28, 2019 at 12:33 AM Yifan Zou <yifan...@google.com> wrote: >>> >>>> I did the prune on beam15. The disk was free but all jobs fails with >>>> other weird problems. Looks like docker prune overkills, but I don't have >>>> evidence. Will look further in AM. >>>> >>>> On Thu, Jun 27, 2019 at 11:20 PM Udi Meiri <eh...@google.com> wrote: >>>> >>>>> See how the hdfs IT already avoids tag collisions. >>>>> >>>>> On Thu, Jun 27, 2019, 20:42 Yichi Zhang <zyi...@google.com> wrote: >>>>> >>>>>> for flakiness I guess a tag is needed to separate concurrent build >>>>>> apart. >>>>>> >>>>>> On Thu, Jun 27, 2019 at 8:39 PM Yichi Zhang <zyi...@google.com> >>>>>> wrote: >>>>>> >>>>>>> maybe a cron job on jenkins node that does docker prune every day? >>>>>>> >>>>>>> On Thu, Jun 27, 2019 at 6:58 PM Ankur Goenka <goe...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> This highlights the race condition caused by using single docker >>>>>>>> registry on a machine. >>>>>>>> If 2 tests create "jenkins-docker-apache.bintray.io/beam/python" one >>>>>>>> after another then the 2nd one will replace the 1st one and cause >>>>>>>> flakyness. >>>>>>>> >>>>>>>> Is their a way to dynamically create and destroy docker repository >>>>>>>> on a machine and clean all the relevant data? >>>>>>>> >>>>>>>> On Thu, Jun 27, 2019 at 3:15 PM Yifan Zou <yifan...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The problem was because of the large quantity of stale docker >>>>>>>>> images generated by the Python portable tests and HDFS IT. >>>>>>>>> >>>>>>>>> Dumping the docker disk usage gives me: >>>>>>>>> >>>>>>>>> TYPE TOTAL ACTIVE SIZE >>>>>>>>> RECLAIMABLE >>>>>>>>> *Images 1039 356 424GB >>>>>>>>> 384.2GB (90%)* >>>>>>>>> Containers 987 2 >>>>>>>>> 2.042GB 2.041GB (99%) >>>>>>>>> Local Volumes 126 0 >>>>>>>>> 392.8MB 392.8MB (100%) >>>>>>>>> >>>>>>>>> REPOSITORY >>>>>>>>> TAG IMAGE ID CREATED >>>>>>>>> SIZE SHARED SIZE UNIQUE SIZE >>>>>>>>> CONTAINERS >>>>>>>>> jenkins-docker-apache.bintray.io/beam/python3 >>>>>>>>> latest ff1b949f4442 22 hours ago >>>>>>>>> 1.639GB >>>>>>>>> 922.3MB 716.9MB 0 >>>>>>>>> jenkins-docker-apache.bintray.io/beam/python >>>>>>>>> latest 1dda7b9d9748 22 hours ago >>>>>>>>> 1.624GB >>>>>>>>> 913.7MB 710.3MB 0 >>>>>>>>> <none> >>>>>>>>> <none> 05458187a0e3 22 >>>>>>>>> hours >>>>>>>>> ago 732.9MB 625.1MB 107.8MB >>>>>>>>> 4 >>>>>>>>> <none> >>>>>>>>> <none> 896f35dd685f 23 >>>>>>>>> hours >>>>>>>>> ago 1.639GB 922.3MB 716.9MB >>>>>>>>> 0 >>>>>>>>> <none> >>>>>>>>> <none> db4d24ca9f2b 23 >>>>>>>>> hours >>>>>>>>> ago 1.624GB 913.7MB 710.3MB >>>>>>>>> 0 >>>>>>>>> <none> >>>>>>>>> <none> 547df4d71c31 23 >>>>>>>>> hours >>>>>>>>> ago 732.9MB 625.1MB 107.8MB >>>>>>>>> 4 >>>>>>>>> <none> >>>>>>>>> <none> dd7d9582c3e0 23 >>>>>>>>> hours >>>>>>>>> ago 1.639GB 922.3MB 716.9MB >>>>>>>>> 0 >>>>>>>>> <none> >>>>>>>>> <none> 664aae255239 23 >>>>>>>>> hours >>>>>>>>> ago 1.624GB 913.7MB 710.3MB >>>>>>>>> 0 >>>>>>>>> <none> >>>>>>>>> <none> b528fedf9228 23 >>>>>>>>> hours ago 732.9MB 625.1MB 107.8MB >>>>>>>>> 4 >>>>>>>>> <none> >>>>>>>>> <none> 8e996f22435e 25 >>>>>>>>> hours ago 1.624GB 913.7MB 710.3MB >>>>>>>>> 0 >>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-818_test >>>>>>>>> latest 24b73b3fec06 25 hours ago 1.305GB >>>>>>>>> 965.7MB 339.5MB 0 >>>>>>>>> <none> >>>>>>>>> <none> 096325fb48de 25 >>>>>>>>> hours >>>>>>>>> ago 732.9MB 625.1MB 107.8MB >>>>>>>>> 2 >>>>>>>>> jenkins-docker-apache.bintray.io/beam/java >>>>>>>>> latest c36d8ff2945d 25 hours ago >>>>>>>>> 685.6MB 625.1MB 60.52MB 0 >>>>>>>>> <none> >>>>>>>>> <none> 11c86ebe025f 26 >>>>>>>>> hours >>>>>>>>> ago 1.639GB 922.3MB 716.9MB >>>>>>>>> 0 >>>>>>>>> <none> >>>>>>>>> <none> 2ecd69c89ec1 26 >>>>>>>>> hours >>>>>>>>> ago 1.624GB 913.7MB 710.3MB >>>>>>>>> 0 >>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8590_test >>>>>>>>> latest 3d1d589d44fe 2 days ago 1.305GB >>>>>>>>> 965.7MB 339.5MB 0 >>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-801_test >>>>>>>>> latest d1cc503ebe8e 2 days ago 1.305GB >>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8577_test >>>>>>>>> latest 8582c6ca6e15 3 days ago 1.305GB >>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8576_test >>>>>>>>> latest 4591e0948170 3 days ago 1.305GB >>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8575_test >>>>>>>>> latest ab181c49d56e 4 days ago 1.305GB >>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8573_test >>>>>>>>> latest 2104ba0a6db7 4 days ago 1.305GB >>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>> ... >>>>>>>>> <1000+ images> >>>>>>>>> >>>>>>>>> I removed unused the images and the beam15 is back now. >>>>>>>>> >>>>>>>>> Opened https://issues.apache.org/jira/browse/BEAM-7650. >>>>>>>>> Ankur, I assigned the issue to you. Feel free to reassign it if >>>>>>>>> needed. >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> Yifan >>>>>>>>> >>>>>>>>> On Thu, Jun 27, 2019 at 11:29 AM Yifan Zou <yifan...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Something were eating the disk. Disconnected the worker so jobs >>>>>>>>>> could be allocated to other nodes. Will look deeper. >>>>>>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>>>>>> /dev/sda1 485G 485G 96K 100% / >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jun 27, 2019 at 10:54 AM Yifan Zou <yifan...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I'm on it. >>>>>>>>>>> >>>>>>>>>>> On Thu, Jun 27, 2019 at 10:17 AM Udi Meiri <eh...@google.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Opened a bug here: >>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-7648 >>>>>>>>>>>> >>>>>>>>>>>> Can someone investigate what's going on? >>>>>>>>>>>> >>>>>>>>>>>