Le jeudi 05 novembre 2015 à 18:01 +0530, Avra Sengupta a écrit : > Hey Michael, > > Thanks, but I don't think that would be necessary anymore. > > Guys, > > I wrote a patch changing logs to set brick status logs to INFO > (http://review.gluster.org/#/c/12515/). Ironically this patch too did > not fail regression on first go, but did fail on the next iteration. > From what I see in the logs (given below). As i had suspected, the > brick connectivity happens a tad bit after the clone command is > executed. Now I don't know why this time delay happens on the regression > setup (that too not all the time), and never locally.
We are running in a cloud environment, there is no SLA for I/O performance. If the host is loaded, then we might see a I/O degradatation. So if that's indeed some kind of race condition, that's quite tricky to find, a slave wouldn't have helped much. > I can think of > various reasons for the same(slower regression machines being my prime > suspect to begin with), but I can't say for sure. I will raise a bug for > this, and try and modify the testcase accordingly. > > Logs: > [2015-11-05 11:25:15.103233] E [MSGID: 106122] > [glusterd-snapshot.c:2376:glusterd_snapshot_clone_prevalidate] > 0-management: Failed to pre validate > *[2015-11-05 11:25:15.103265] E [MSGID: 106443] > [glusterd-snapshot.c:2398:glusterd_snapshot_clone_prevalidate] > 0-management: One or more bricks are not running. Please run snapshot > status command to see brick sta** > **tus.* > Please start the stopped brick and then issue snapshot clone command > [2015-11-05 11:25:15.103280] W [MSGID: 106443] > [glusterd-snapshot.c:8398:glusterd_snapshot_prevalidate] 0-management: > Snapshot clone pre-validation failed > [2015-11-05 11:25:15.103294] W [MSGID: 106122] > [glusterd-mgmt.c:166:gd_mgmt_v3_pre_validate_fn] 0-management: Snapshot > Prevalidate Failed > [2015-11-05 11:25:15.103305] E [MSGID: 106122] > [glusterd-mgmt.c:820:glusterd_mgmt_v3_pre_validate] 0-management: Pre > Validation failed for operation Snapshot on local node > [2015-11-05 11:25:15.103315] E [MSGID: 106122] > [glusterd-mgmt.c:2166:glusterd_mgmt_v3_initiate_snap_phases] > 0-management: Pre Validation Failed > [2015-11-05 11:25:15.103332] E [MSGID: 106027] > [glusterd-snapshot.c:7946:glusterd_snapshot_clone_postvalidate] > 0-management: unable to find clone clone1 volinfo > [2015-11-05 11:25:15.103342] W [MSGID: 106444] > [glusterd-snapshot.c:8837:glusterd_snapshot_postvalidate] 0-management: > Snapshot create post-validation failed > [2015-11-05 11:25:15.103352] W [MSGID: 106121] > [glusterd-mgmt.c:323:gd_mgmt_v3_post_validate_fn] 0-management: > postvalidate operation failed > [2015-11-05 11:25:15.103362] E [MSGID: 106121] > [glusterd-mgmt.c:1585:glusterd_mgmt_v3_post_validate] 0-management: Post > Validation failed for operation Snapshot on local node > [2015-11-05 11:25:15.103372] E [MSGID: 106122] > [glusterd-mgmt.c:2286:glusterd_mgmt_v3_initiate_snap_phases] > 0-management: Post Validation Failed > [2015-11-05 11:25:15.109994]:++++++++++ > G_LOG:./tests/bugs/snapshot/bug-1275616.t: TEST: 42 42 149 > snap_info_volume CLI Snaps Available patchy ++++++++++ > [2015-11-05 11:25:15.239358]:++++++++++ > G_LOG:./tests/bugs/snapshot/bug-1275616.t: TEST: 43 43 150 > snap_config_volume CLI snap-max-hard-limit patchy ++++++++++ > [2015-11-05 11:25:15.378255]:++++++++++ > G_LOG:./tests/bugs/snapshot/bug-1275616.t: TEST: 45 45 200 > snap_info_volume CLI Snaps Available clone1 ++++++++++ > [2015-11-05 11:25:15.501970] E [MSGID: 106027] > [glusterd-snapshot.c:3574:glusterd_snapshot_get_info_by_volume] > 0-management: Volume (clone1) does not exist [Invalid argument] > [2015-11-05 11:25:15.502024] E [MSGID: 106027] > [glusterd-snapshot.c:3766:glusterd_handle_snapshot_info] 0-management: > Failed to get volume info of volume clone1 [Invalid argument] > [2015-11-05 11:25:15.502061] W [MSGID: 106063] > [glusterd-snapshot.c:9082:glusterd_handle_snapshot_fn] 0-management: > Snapshot info failed > [2015-11-05 11:25:15.510016]:++++++++++ > G_LOG:./tests/bugs/snapshot/bug-1275616.t: TEST: 46 46 200 > snap_config_volume CLI snap-max-hard-limit clone1 ++++++++++ > [2015-11-05 11:25:15.639515] E [MSGID: 106060] > [glusterd-snapshot.c:438:snap_max_limits_display_commit] 0-management: > Volume (clone1) does not exist > [2015-11-05 11:25:15.639543] E [MSGID: 106090] > [glusterd-snapshot.c:1446:glusterd_handle_snapshot_config] 0-management: > snap-max-limit display commit failed. > [2015-11-05 11:25:15.639558] W [MSGID: 106045] > [glusterd-snapshot.c:9101:glusterd_handle_snapshot_fn] 0-management: > snapshot config failed > *[2015-11-05 11:25:15.684746] I > [glusterd-utils.c:4883:glusterd_set_brick_status] 0-glusterd: Setting > brick > slave28.cloud.gluster.org:/var/run/gluster/snaps/7db8306c170541eb98c02633407bf625/brick1 > > status to started* > > Regards, > Avra > > On 11/05/2015 05:07 PM, Michael Scherer wrote: > > Le jeudi 05 novembre 2015 à 15:59 +0530, Avra Sengupta a écrit : > >> On 11/05/2015 03:57 PM, Avra Sengupta wrote: > >>> On 11/05/2015 03:56 PM, Vijay Bellur wrote: > >>>> On Thursday 05 November 2015 12:19 PM, Avra Sengupta wrote: > >>>>> Hi, > >>>>> > >>>>> We investigated the logs in the regression failures that encountered > >>>>> this and following are the findings: > >>>>> 1. snapshot clone failure is indeed the reason for the failure. > >>>>> 2. snapshot clone has failed in pre-validation with the error that the > >>>>> brick of snap3 is not up and running. > >>>>> 3. snap3 was created, and subsequently started (because of > >>>>> activate-on-create being enabled), long before we tried to create a > >>>>> clone out of it. > >>>>> 4. The snap3's brick shows no failure logs, and thereby gives us no > >>>>> reason to believe that it did not start properly in the course of the > >>>>> testcase. > >>>>> 5. Which leaves us with the assumption (it is an assumption because we > >>>>> do not have any logs backing it) that, there was some delay in either > >>>>> the start of the brick process for snap3, or for glusterd to register > >>>>> that the same has started, and before either of these events could have > >>>>> happened the clone command got executed and failed. This would make > >>>>> it a > >>>>> race. > >>>>> > >>>>> Some other things to consider about the particular testcase: > >>>>> 1. It did pass (and still passes consistently), in our local systems > >>>>> making it not reproducible locally. > >>>>> 2. The patch was merged after both linux and netbsd regressions passed > >>>>> (at one go). > >>>>> 3. The release 3.7 backported patch for the same, has also passed both > >>>>> the linux and netbsd regressions as of now. > >>>>> > >>>>> The rationale behind mentioning the above three points being, this > >>>>> testcase has passed locally, as well as on the regression setups(not > >>>>> just at the time of merge, but even now), which brings me back to the > >>>>> assumption mentioned in point #5 . To get more clarity on the said > >>>>> assumption we need access to one of the regression setups, so that we > >>>>> can try reproducing the failure in that environment and get some proof > >>>>> of what really is happening. > >>>>> > >>>>> Vijay, > >>>>> > >>>>> Could you please provide us with a jenkins linux slave to perform the > >>>>> above mentioned validity > >>>>> > >>>> Please send out a request on gluster-infra if not done so and Michael > >>>> Scherer should be able to help. > >>>> > >>>> Thanks! > >>>> Vijay > >>>> > >>> + Adding gluster-infra and Michael > >>> > >>> Could you please provide us with a jenkins linux slave to perform the > >>> above mentioned validity > > So you just want 1 single centos 6 gluster slave, who need access to it, > > and for how long ? > > > > Can you provides a ssh key so I can create a snapshot and give to you ? > > > > > -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-infra
