Please answer these questions before submitting your issue. - Why do you submit this issue? - [ ] Question or discussion - [ ] Bug - [ ] Requirement - [x] Feature or performance improvement
___ ### Question - What do you want to know? ___ ### Bug - Which version of SkyWalking, OS and JRE? - Which company or project? - What happen? If possible, provide a way for reproducing the error. e.g. demo application, component version. ___ ### Requirement or improvement - Please describe about your requirements or improvement suggestions. ## background When the user upgrades SkyWalking, the data model of the old and new versions is inconsistent, causing the server to fail to start normally. At this time, the user may adopt the method of clearing the library. The registration data is lost. The interface will not be able to display the statistical indicators reported by the client that lacks registration information. At the same time, because under the existing mechanism, The user needs to restart the client to complete the re-registration operation, but the business system restart is not acceptable because of the monitoring system problem. So we need a mechanism to re-register without restarting the business system. ## ideas The registration data is lost on the server side and there are two compensation measures: * Push the registration data cached by the client to the server again, but the ID of the registration data in the cache may have been occupied by other newly registered clients. Solving such problems is costly. * Reset the registration data of the problem client. The key elements of this solution are how to identify the problem client and how to send the command to the problem client. ### Key issues #### Uniquely identifies At present, the client automatically generates a globally unique agentUUID as the unique identifier of the client instance. However, the ID of the client cannot be accurately located by the operation and maintenance personnel. Therefore, the startup file and startup parameters are required. Add the client instance name attribute to the user manually when it is deployed. Because the recovery function is not a necessary function of the system, as a non-essential option, the ability to automatically generate the original global unique agentUUID is retained. The original agentUUID is overwritten only when the user specifies the client instance name in the startup file or startup parameters. In order to avoid modifying the 5.x protocol, the other language probes are linked and the attributes of the instance name are added in the heartbeat interface of the 6.x protocol. #### Problem finding The client whose registration data is missing is not aware of it. Only the server can find it by parsing the data reported by the client. If the trace details are reported in the trace interface, the number of trace details is too large and the performance is too large. Therefore, consider the heartbeat interface of the instance to discover the problem client. However, only the instance ID is reported in the heartbeat interface. And friendly prompts, you need to modify this interface, add the instance name attribute in the interface. Check the ID and instance name at the same time, and prompt in the error log information to check the problematic instance information. ### Directive is issued Considering the background of this solution is a very useful function, the instruction does not need to be sent to the client through the server, and the client is directly logged in to the client. The instruction to reset the registration data, while considering the security problem, can not open the network interface to receive instructions from the client, so the file scanning and listening mode are used to issue the instruction. Considering the friendliness of the operator after the command is issued, the client will modify the status information in the file to inform the execution of the reset command. ## Program ### About unique identifier Configure the instance_code field in agent.conf to ensure that it is globally unique and meaningful, for example (serviceCode_ip_1), so that the operation and maintenance personnel can quickly identify the server where the agent is located. ### About problem finding The instance heartbeat protocol adds instance_code. The server-side checks the id from the cache and persistent storage. If there is a value, check whether the insatance_code is consistent. If the id is inconsistent or cannot be found, the print log (including instanse_code) is sent to notify the user to reset the agent. ### About the order - The listener thread then checks the status value in the register.status every 10 seconds. If it is registered, it will clear the cache of service, instance, network, and endpoint. - Because the agent-side segment generates network_id and srvice_id, the segment after the cache is emptied is discarded directly before being registered and returned before the network and endpoint. Before the service and instance are registered and returned, the segment is not converted. - Reset status feedback. After the cache is cleared, status is set to the empty string "". The changes in service_id and instance_id will be written to the file during the reset process. ### About register.status file read and write Contains 2 attributes Status -> if register triggers registration Service_id -> indicates the current service_id value, telling the user that the service_id registration is successful. Instance_id -> The current service_id value, indicating that the user instance_id is successfully registered. [ Full content available at: https://github.com/apache/incubator-skywalking/issues/1631 ] This message was relayed via gitbox.apache.org for [email protected]
