Hi Shyam, we are already doing it. we wait for rebalance status to be
complete. We loop. we keep checking if the status is complete for '20'
minutes or so.

-Shwetha

On Tue, Aug 29, 2017 at 7:04 PM, Shyam Ranganathan <[email protected]>
wrote:

> On 08/29/2017 09:31 AM, Atin Mukherjee wrote:
>
>>
>>
>> On Tue, Aug 29, 2017 at 4:13 AM, Shyam Ranganathan <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Nigel, Shwetha,
>>
>>     The latest Glusto run [a] that was started by Nigel, post fixing the
>>     prior timeout issue, failed (much later though) again.
>>
>>     I took a look at the logs and my analysis is here [b]
>>
>>     @atin, @kaushal, @ppai can you take a look and see if the analysis
>>     is correct?
>>
>>
>> I took a look at the logs and here is my theory:
>>
>> glusterd starts the rebalance daemon through runner framework with nowait
>> mode which essentially means that even though glusterd reports back a
>> success back to CLI for rebalance start, one of the node might take some
>> more additional time to start the rebalance process and establish rpc
>> connection. In this case we hit a race where while one of the node was
>> still trying to start the rebalance process a rebalance status command was
>> triggered which eventually failed on the node as rpc connection wasn't
>> successful and originator glusterd's commit op failed with  ""Received
>> commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" failure.
>> Technically to avoid all these spurious time out issues we try to check the
>> status in a loop till a certain timeout. Isn't that the case in glusto? If
>> my analysis is correct, you shouldn't be seeing this failure on the 2nd
>> attempt as its a race.
>>
>
> Thanks Atin.
>
> In this case there is no second check or a timed check (sleep or otherwise
> (EXPECT_WITHIN like constructs)).
>
> @Shwetha, can we fix up this test and give it another go?
>
>
>>
>>     In short glusterd has got an error when checking for rebalance stats
>>     from one of the nodes as:
>>     "Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d"
>>
>>     and the rebalance deamon on the node with that UUID is not really
>>     ready to serve requests when this was called, hence I am assuming
>>     this is causing the error. But need a once over by one of you folks.
>>
>>     @Shwetha, can we add a further timeout between rebalance start and
>>     checking the status, just so that we avoid this timing issue on
>>     these nodes.
>>
>>     Thanks,
>>     Shyam
>>
>>     [a] glusto run:
>>     https://ci.centos.org/view/Gluster/job/gluster_glusto/377/
>>     <https://ci.centos.org/view/Gluster/job/gluster_glusto/377/>
>>
>>     [b] analysis of the failure:
>>     https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w
>>     <https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w>
>>
>>     On 08/25/2017 04:29 PM, Shyam Ranganathan wrote:
>>
>>         Nigel was kind enough to kick off a glusto run on 3.12 head a
>>         couple of days back. The status can be seen here [1].
>>
>>         The run failed, but managed to get past what Glusto does on
>>         master (see [2]). Not that this is a consolation, but just
>>         stating the fact.
>>
>>         The run [1] failed at,
>>         17:05:57
>>         functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_
>> glusterfs::test_self_heal_when_io_in_progress
>>         FAILED
>>
>>         The test case failed due to,
>>         17:10:28 E       AssertionError: ('Volume %s : All process are
>>         not online', 'testvol_dispersed')
>>
>>         The test case can be seen here [3], and the reason for failure
>>         is that Glusto did not wait long enough for the down brick to
>>         come up (it waited for 10 seconds, but the brick came up after
>>         12 seconds or within the same second as the test for it being
>>         up. The log snippets pointing to this problem are here [4]. In
>>         short there was no real bug or issue that caused the failure as
>> yet.
>>
>>         Glusto as a gating factor for this release was desirable, but
>>         having got this far on 3.12 does help.
>>
>>         @nigel, we could try post increasing the timeout between
>>         bringing the brick up to checking if it is up, and try another
>>         run, let me know if that works, and what is needed from me to
>>         get this going.
>>
>>         Shyam
>>
>>         [1] Glusto 3.12 run:
>>         https://ci.centos.org/view/Gluster/job/gluster_glusto/365/
>>         <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/>
>>
>>         [2] Glusto on master:
>>         https://ci.centos.org/view/Gluster/job/gluster_glusto/360/
>> testReport/functional.bvt.test_cvt/
>>         <https://ci.centos.org/view/Gluster/job/gluster_glusto/360/
>> testReport/functional.bvt.test_cvt/>
>>
>>
>>         [3] Failed test case:
>>         https://ci.centos.org/view/Gluster/job/gluster_glusto/365/
>> testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dis
>> persed_glusterfs/test_self_heal_when_io_in_progress/
>>         <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/
>> testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dis
>> persed_glusterfs/test_self_heal_when_io_in_progress/>
>>
>>
>>         [4] Log analysis pointing to the failed check:
>>         https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA
>>         <https://paste.fedoraproject.org/paste/znTPiFLrc2%7EvsWuoYRToZA>
>>
>>         "Releases are made better together"
>>         _______________________________________________
>>         Gluster-devel mailing list
>>         [email protected] <mailto:[email protected]>
>>         http://lists.gluster.org/mailman/listinfo/gluster-devel
>>         <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>
>>     _______________________________________________
>>     Gluster-devel mailing list
>>     [email protected] <mailto:[email protected]>
>>     http://lists.gluster.org/mailman/listinfo/gluster-devel
>>     <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>
>>
>>
_______________________________________________
Gluster-devel mailing list
[email protected]
http://lists.gluster.org/mailman/listinfo/gluster-devel

Reply via email to