sunhelly commented on issue #354: HBASE-20368 Fix RIT stuck when a rsgroup has no online servers but AM… URL: https://github.com/apache/hbase/pull/354#issuecomment-508384225 @jatsakthi Thanks for reviewing. > what's the difference between already existing testKillRS in the same class vs the newly added testKillAllRSInGroupAndThenAddNew? My patch was completed at Apr,2018, while testKillRS was added at Oct,2018. I'm sorry when I didn't care testKillRS when adding testKillAllRSInGroupAndThenAddNew. But I have looked through the two UTs, the difference is that testKillRS disables and enables table to make sure that table regions will be reassigned. But in testKillAllRSInGroupAndThenAddNew, I think table regions should be reassigned and opened correctly while no manual intervention after a none-online-servers rsgroup has online servers again, which may happen in real production environment. > Without the changes, the newly added test throws: ConstraintException: Target RSGroup my_group is same as source Name:my_group, Servers:[192.168.0.69:60479, 192.168.0.69:60482], Tables:[Group_testKillAllRSInGroupAndThenAddNew] RSGroup. Is that what we were basing our test to fail upon? Also because of codes changing after the patch, there exists new problems. Firstly, I will answer why throwing this ConstraintException: Target RSGroup ... is the same as ... Following the UT progress: 1. add ut_group, which has only one server 2. move ut_table to ut_group (all regions of table will be on the rsgroup server) 3. ut_group server crashed 4. ServerCrashProcedure start, will reassign group table regions, start TransitRegionStateProcedure 5. assignCandidate is null, and table regions will be assgin to 'lastHost' ,which are in 'default' group 6. AM process assignment plans by LoadBalancer, which is RSGroupBasedLoadBalancer now, and in which, retainAssignment() will correct misplaced regions,see#205, and ut_table regions are all misplaced(the reason is in step 5, 'lastHost' belongs to 'default' group ) but candidates is empty, so they will be assigned to BOGUS server(localhost,1,1). 7. AM acceptPlans, will open ut_table regions in the BOGUS server(localhost,1,1). 8. ut_table regions will definitely failed open,and region location will be set to `null` to reassign the regions, but until there exist online servers in the ut_group, ut_table regions will be in failed_open state and the location may be null. 9. move a new server to the ut_group, when check the regions on the new server, we called getRegions(), see#115@RSGroupAdminServer, and at#128, when get location address, will throw NPE. 10. hbase client receipt Exception and will retry to call moveServer(), and got `ConstraintException: Target RSGroup ... is same as source Name...` detailed error logs are: > 2019-07-04 13:57:53,911 ERROR [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=41041] ipc.RpcServer(432): Unexpected throwable object > java.lang.NullPointerException > at org.apache.hadoop.hbase.rsgroup.RSGroupAdminServer.getRegions(RSGroupAdminServer.java:129) > at org.apache.hadoop.hbase.rsgroup.RSGroupAdminServer.moveServerRegionsFromGroup(RSGroupAdminServer.java:215) > at org.apache.hadoop.hbase.rsgroup.RSGroupAdminServer.moveServers(RSGroupAdminServer.java:326) > at org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl.moveServers(RSGroupAdminEndpoint.java:218) > at org.apache.hadoop.hbase.protobuf.generated.RSGroupAdminProtos$RSGroupAdminService.callMethod(RSGroupAdminProtos.java:13870) > at org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:889) > at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:374) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > 2019-07-04 13:57:53,912 DEBUG [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=41041] ipc.CallRunner(144): callId: 33 service: MasterService methodName: ExecMasterService size: 104 connection: 127.0.0.1:38874 deadline: 1562219933867, exception=java.io.IOException > 2019-07-04 13:57:54,025 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=41041] rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(211): Client=haxiaolin//127.0.0.1 move servers [localhost:41689] to rsgroup my_group > 2019-07-04 13:57:54,025 DEBUG [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=41041] ipc.CallRunner(144): callId: 34 service: MasterService methodName: ExecMasterService size: 104 connection: 127.0.0.1:38874 deadline: 1562219934024, exception=org.apache.hadoop.hbase.constraint.ConstraintException: Target RSGroup my_group is same as source Name:my_group, Servers:[localhost:36041, localhost:41689], Tables:[Group_testKillAllRSInGroupAndThenAddNew] RSGroup. > > org.apache.hadoop.hbase.constraint.ConstraintException: org.apache.hadoop.hbase.constraint.ConstraintException: Target RSGroup my_group is same as source Name:my_group, Servers:[localhost:36041, localhost:41689], Tables:[Group_testKillAllRSInGroupAndThenAddNew] RSGroup. > at org.apache.hadoop.hbase.rsgroup.RSGroupAdminServer.moveServers(RSGroupAdminServer.java:299) > at org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl.moveServers(RSGroupAdminEndpoint.java:218) > at org.apache.hadoop.hbase.protobuf.generated.RSGroupAdminProtos$RSGroupAdminService.callMethod(RSGroupAdminProtos.java:13870) I think my patch works well because I added BOGUS check in AM, and will retry to reassign these regions instead of accepting these plans and opening them on the BOGUS server. But patch changes of RSGroupBasedLoadBalancer is useless in this case. May it should be deleted because queueAssgin()#TransitRegionStateProcedure was changed by Duo after my patch, and regions will always be misplaced when its last location is out of group and is reassigned. The process of misplaced regions in RSGroupBasedLoadBalancer I think is OK.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
