Hi Shyam, we are already doing it. we wait for rebalance status to be complete. We loop. we keep checking if the status is complete for '20' minutes or so.
-Shwetha On Tue, Aug 29, 2017 at 7:04 PM, Shyam Ranganathan <[email protected]> wrote: > On 08/29/2017 09:31 AM, Atin Mukherjee wrote: > >> >> >> On Tue, Aug 29, 2017 at 4:13 AM, Shyam Ranganathan <[email protected] >> <mailto:[email protected]>> wrote: >> >> Nigel, Shwetha, >> >> The latest Glusto run [a] that was started by Nigel, post fixing the >> prior timeout issue, failed (much later though) again. >> >> I took a look at the logs and my analysis is here [b] >> >> @atin, @kaushal, @ppai can you take a look and see if the analysis >> is correct? >> >> >> I took a look at the logs and here is my theory: >> >> glusterd starts the rebalance daemon through runner framework with nowait >> mode which essentially means that even though glusterd reports back a >> success back to CLI for rebalance start, one of the node might take some >> more additional time to start the rebalance process and establish rpc >> connection. In this case we hit a race where while one of the node was >> still trying to start the rebalance process a rebalance status command was >> triggered which eventually failed on the node as rpc connection wasn't >> successful and originator glusterd's commit op failed with ""Received >> commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" failure. >> Technically to avoid all these spurious time out issues we try to check the >> status in a loop till a certain timeout. Isn't that the case in glusto? If >> my analysis is correct, you shouldn't be seeing this failure on the 2nd >> attempt as its a race. >> > > Thanks Atin. > > In this case there is no second check or a timed check (sleep or otherwise > (EXPECT_WITHIN like constructs)). > > @Shwetha, can we fix up this test and give it another go? > > >> >> In short glusterd has got an error when checking for rebalance stats >> from one of the nodes as: >> "Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" >> >> and the rebalance deamon on the node with that UUID is not really >> ready to serve requests when this was called, hence I am assuming >> this is causing the error. But need a once over by one of you folks. >> >> @Shwetha, can we add a further timeout between rebalance start and >> checking the status, just so that we avoid this timing issue on >> these nodes. >> >> Thanks, >> Shyam >> >> [a] glusto run: >> https://ci.centos.org/view/Gluster/job/gluster_glusto/377/ >> <https://ci.centos.org/view/Gluster/job/gluster_glusto/377/> >> >> [b] analysis of the failure: >> https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w >> <https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w> >> >> On 08/25/2017 04:29 PM, Shyam Ranganathan wrote: >> >> Nigel was kind enough to kick off a glusto run on 3.12 head a >> couple of days back. The status can be seen here [1]. >> >> The run failed, but managed to get past what Glusto does on >> master (see [2]). Not that this is a consolation, but just >> stating the fact. >> >> The run [1] failed at, >> 17:05:57 >> functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_ >> glusterfs::test_self_heal_when_io_in_progress >> FAILED >> >> The test case failed due to, >> 17:10:28 E AssertionError: ('Volume %s : All process are >> not online', 'testvol_dispersed') >> >> The test case can be seen here [3], and the reason for failure >> is that Glusto did not wait long enough for the down brick to >> come up (it waited for 10 seconds, but the brick came up after >> 12 seconds or within the same second as the test for it being >> up. The log snippets pointing to this problem are here [4]. In >> short there was no real bug or issue that caused the failure as >> yet. >> >> Glusto as a gating factor for this release was desirable, but >> having got this far on 3.12 does help. >> >> @nigel, we could try post increasing the timeout between >> bringing the brick up to checking if it is up, and try another >> run, let me know if that works, and what is needed from me to >> get this going. >> >> Shyam >> >> [1] Glusto 3.12 run: >> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/ >> <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/> >> >> [2] Glusto on master: >> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/ >> testReport/functional.bvt.test_cvt/ >> <https://ci.centos.org/view/Gluster/job/gluster_glusto/360/ >> testReport/functional.bvt.test_cvt/> >> >> >> [3] Failed test case: >> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/ >> testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dis >> persed_glusterfs/test_self_heal_when_io_in_progress/ >> <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/ >> testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dis >> persed_glusterfs/test_self_heal_when_io_in_progress/> >> >> >> [4] Log analysis pointing to the failed check: >> https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA >> <https://paste.fedoraproject.org/paste/znTPiFLrc2%7EvsWuoYRToZA> >> >> "Releases are made better together" >> _______________________________________________ >> Gluster-devel mailing list >> [email protected] <mailto:[email protected]> >> http://lists.gluster.org/mailman/listinfo/gluster-devel >> <http://lists.gluster.org/mailman/listinfo/gluster-devel> >> >> _______________________________________________ >> Gluster-devel mailing list >> [email protected] <mailto:[email protected]> >> http://lists.gluster.org/mailman/listinfo/gluster-devel >> <http://lists.gluster.org/mailman/listinfo/gluster-devel> >> >> >>
_______________________________________________ Gluster-devel mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-devel
