I reimaged the beam15. The worker is re-enabled. Let us know if anything weird happens on any agent.
Thanks. Yifan On Mon, Jul 1, 2019 at 10:00 AM Yifan Zou <[email protected]> wrote: > https://issues.apache.org/jira/browse/BEAM-7650 tracks the docker issue. > > On Sun, Jun 30, 2019 at 2:35 PM Mark Liu <[email protected]> wrote: > >> Thank you for triaging and working out a solution Yifan and Ankur. >> >> Ankur, from what you discovered, we should fix this race condition >> otherwise same problem will happen in the future. Is there a jira tracking >> this issue? >> >> On Fri, Jun 28, 2019 at 4:56 PM Yifan Zou <[email protected]> wrote: >> >>> Sorry for the inconvenience. I disabled the worker. I'll need more time >>> to restore it. >>> >>> On Fri, Jun 28, 2019 at 3:56 PM Daniel Oliveira <[email protected]> >>> wrote: >>> >>>> Any updates to this issue today? It seems like this (or a similar bug) >>>> is still happening across many Pre and Postcommits. >>>> >>>> On Fri, Jun 28, 2019 at 12:33 AM Yifan Zou <[email protected]> wrote: >>>> >>>>> I did the prune on beam15. The disk was free but all jobs fails with >>>>> other weird problems. Looks like docker prune overkills, but I don't have >>>>> evidence. Will look further in AM. >>>>> >>>>> On Thu, Jun 27, 2019 at 11:20 PM Udi Meiri <[email protected]> wrote: >>>>> >>>>>> See how the hdfs IT already avoids tag collisions. >>>>>> >>>>>> On Thu, Jun 27, 2019, 20:42 Yichi Zhang <[email protected]> wrote: >>>>>> >>>>>>> for flakiness I guess a tag is needed to separate concurrent build >>>>>>> apart. >>>>>>> >>>>>>> On Thu, Jun 27, 2019 at 8:39 PM Yichi Zhang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> maybe a cron job on jenkins node that does docker prune every day? >>>>>>>> >>>>>>>> On Thu, Jun 27, 2019 at 6:58 PM Ankur Goenka <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This highlights the race condition caused by using single docker >>>>>>>>> registry on a machine. >>>>>>>>> If 2 tests create "jenkins-docker-apache.bintray.io/beam/python" one >>>>>>>>> after another then the 2nd one will replace the 1st one and cause >>>>>>>>> flakyness. >>>>>>>>> >>>>>>>>> Is their a way to dynamically create and destroy docker repository >>>>>>>>> on a machine and clean all the relevant data? >>>>>>>>> >>>>>>>>> On Thu, Jun 27, 2019 at 3:15 PM Yifan Zou <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The problem was because of the large quantity of stale docker >>>>>>>>>> images generated by the Python portable tests and HDFS IT. >>>>>>>>>> >>>>>>>>>> Dumping the docker disk usage gives me: >>>>>>>>>> >>>>>>>>>> TYPE TOTAL ACTIVE SIZE >>>>>>>>>> RECLAIMABLE >>>>>>>>>> *Images 1039 356 >>>>>>>>>> 424GB 384.2GB (90%)* >>>>>>>>>> Containers 987 2 >>>>>>>>>> 2.042GB 2.041GB (99%) >>>>>>>>>> Local Volumes 126 0 >>>>>>>>>> 392.8MB 392.8MB (100%) >>>>>>>>>> >>>>>>>>>> REPOSITORY >>>>>>>>>> TAG IMAGE ID CREATED >>>>>>>>>> SIZE SHARED SIZE UNIQUE SIZE >>>>>>>>>> CONTAINERS >>>>>>>>>> jenkins-docker-apache.bintray.io/beam/python3 >>>>>>>>>> latest ff1b949f4442 22 hours ago >>>>>>>>>> 1.639GB >>>>>>>>>> 922.3MB 716.9MB 0 >>>>>>>>>> jenkins-docker-apache.bintray.io/beam/python >>>>>>>>>> latest 1dda7b9d9748 22 hours ago >>>>>>>>>> 1.624GB >>>>>>>>>> 913.7MB 710.3MB 0 >>>>>>>>>> <none> >>>>>>>>>> <none> 05458187a0e3 >>>>>>>>>> 22 hours >>>>>>>>>> ago 732.9MB 625.1MB 107.8MB >>>>>>>>>> 4 >>>>>>>>>> <none> >>>>>>>>>> <none> 896f35dd685f >>>>>>>>>> 23 hours >>>>>>>>>> ago 1.639GB 922.3MB 716.9MB >>>>>>>>>> 0 >>>>>>>>>> <none> >>>>>>>>>> <none> db4d24ca9f2b >>>>>>>>>> 23 hours >>>>>>>>>> ago 1.624GB 913.7MB 710.3MB >>>>>>>>>> 0 >>>>>>>>>> <none> >>>>>>>>>> <none> 547df4d71c31 >>>>>>>>>> 23 >>>>>>>>>> hours ago 732.9MB 625.1MB 107.8MB >>>>>>>>>> 4 >>>>>>>>>> <none> >>>>>>>>>> <none> dd7d9582c3e0 >>>>>>>>>> 23 >>>>>>>>>> hours ago 1.639GB 922.3MB 716.9MB >>>>>>>>>> 0 >>>>>>>>>> <none> >>>>>>>>>> <none> 664aae255239 >>>>>>>>>> 23 >>>>>>>>>> hours ago 1.624GB 913.7MB 710.3MB >>>>>>>>>> 0 >>>>>>>>>> <none> >>>>>>>>>> <none> b528fedf9228 >>>>>>>>>> 23 >>>>>>>>>> hours ago 732.9MB 625.1MB 107.8MB >>>>>>>>>> 4 >>>>>>>>>> <none> >>>>>>>>>> <none> 8e996f22435e >>>>>>>>>> 25 >>>>>>>>>> hours ago 1.624GB 913.7MB 710.3MB >>>>>>>>>> 0 >>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-818_test >>>>>>>>>> latest 24b73b3fec06 25 hours ago 1.305GB >>>>>>>>>> 965.7MB 339.5MB 0 >>>>>>>>>> <none> >>>>>>>>>> <none> 096325fb48de >>>>>>>>>> 25 hours >>>>>>>>>> ago 732.9MB 625.1MB 107.8MB >>>>>>>>>> 2 >>>>>>>>>> jenkins-docker-apache.bintray.io/beam/java >>>>>>>>>> latest c36d8ff2945d 25 hours ago >>>>>>>>>> 685.6MB 625.1MB 60.52MB 0 >>>>>>>>>> <none> >>>>>>>>>> <none> 11c86ebe025f >>>>>>>>>> 26 >>>>>>>>>> hours ago 1.639GB 922.3MB 716.9MB >>>>>>>>>> 0 >>>>>>>>>> <none> >>>>>>>>>> <none> 2ecd69c89ec1 >>>>>>>>>> 26 >>>>>>>>>> hours ago 1.624GB 913.7MB 710.3MB >>>>>>>>>> 0 >>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8590_test >>>>>>>>>> latest 3d1d589d44fe 2 days ago 1.305GB >>>>>>>>>> 965.7MB 339.5MB 0 >>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-801_test >>>>>>>>>> latest d1cc503ebe8e 2 days ago 1.305GB >>>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8577_test >>>>>>>>>> latest 8582c6ca6e15 3 days ago 1.305GB >>>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8576_test >>>>>>>>>> latest 4591e0948170 3 days ago 1.305GB >>>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8575_test >>>>>>>>>> latest ab181c49d56e 4 days ago 1.305GB >>>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8573_test >>>>>>>>>> latest 2104ba0a6db7 4 days ago 1.305GB >>>>>>>>>> 965.7MB 339.2MB 0 >>>>>>>>>> ... >>>>>>>>>> <1000+ images> >>>>>>>>>> >>>>>>>>>> I removed unused the images and the beam15 is back now. >>>>>>>>>> >>>>>>>>>> Opened https://issues.apache.org/jira/browse/BEAM-7650. >>>>>>>>>> Ankur, I assigned the issue to you. Feel free to reassign it if >>>>>>>>>> needed. >>>>>>>>>> >>>>>>>>>> Thank you. >>>>>>>>>> Yifan >>>>>>>>>> >>>>>>>>>> On Thu, Jun 27, 2019 at 11:29 AM Yifan Zou <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Something were eating the disk. Disconnected the worker so jobs >>>>>>>>>>> could be allocated to other nodes. Will look deeper. >>>>>>>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>>>>>>> /dev/sda1 485G 485G 96K 100% / >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jun 27, 2019 at 10:54 AM Yifan Zou <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I'm on it. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jun 27, 2019 at 10:17 AM Udi Meiri <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Opened a bug here: >>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-7648 >>>>>>>>>>>>> >>>>>>>>>>>>> Can someone investigate what's going on? >>>>>>>>>>>>> >>>>>>>>>>>>
