Please answer these questions before submitting your issue.

- Why do you submit this issue?
- [ ] Question or discussion
- [ ] Bug
- [ ] Requirement
- [x] Feature or performance improvement

___
### Question
- What do you want to know?

___
### Bug
- Which version of SkyWalking, OS and JRE?

- Which company or project?

- What happen?
If possible, provide a way for reproducing the error. e.g. demo application, 
component version.

___
### Requirement or improvement
- Please describe about your requirements or improvement suggestions.

## background
When the user upgrades SkyWalking, the data model of the old and new versions 
is inconsistent, causing the server to fail to start normally. At this time, 
the user may adopt the method of clearing the library.
The registration data is lost. The interface will not be able to display the 
statistical indicators reported by the client that lacks registration 
information. At the same time, because under the existing mechanism,
The user needs to restart the client to complete the re-registration operation, 
but the business system restart is not acceptable because of the monitoring 
system problem.
So we need a mechanism to re-register without restarting the business system.

## ideas
The registration data is lost on the server side and there are two compensation 
measures:
* Push the registration data cached by the client to the server again, but the 
ID of the registration data in the cache may have been occupied by other newly 
registered clients.
Solving such problems is costly.
* Reset the registration data of the problem client. The key elements of this 
solution are how to identify the problem client and how to send the command to 
the problem client.

### Key issues
#### Uniquely identifies
At present, the client automatically generates a globally unique agentUUID as 
the unique identifier of the client instance. However, the ID of the client 
cannot be accurately located by the operation and maintenance personnel. 
Therefore, the startup file and startup parameters are required. Add the client 
instance name attribute to the user manually when it is deployed.
Because the recovery function is not a necessary function of the system, as a 
non-essential option, the ability to automatically generate the original global 
unique agentUUID is retained. The original agentUUID is overwritten only when 
the user specifies the client instance name in the startup file or startup 
parameters. 
In order to avoid modifying the 5.x protocol, the other language probes are 
linked and the attributes of the instance name are added in the heartbeat 
interface of the 6.x protocol.

#### Problem finding
The client whose registration data is missing is not aware of it. Only the 
server can find it by parsing the data reported by the client. If the trace 
details are reported in the trace interface, the number of trace details is too 
large and the performance is too large. Therefore, consider the heartbeat 
interface of the instance to discover the problem client. However, only the 
instance ID is reported in the heartbeat interface. And friendly prompts, you 
need to modify this interface, add the instance name attribute in the interface.
Check the ID and instance name at the same time, and prompt in the error log 
information to check the problematic instance information.

### Directive is issued
Considering the background of this solution is a very useful function, the 
instruction does not need to be sent to the client through the server, and the 
client is directly logged in to the client.
The instruction to reset the registration data, while considering the security 
problem, can not open the network interface to receive instructions from the 
client, so the file scanning and listening mode are used to issue the 
instruction.
Considering the friendliness of the operator after the command is issued, the 
client will modify the status information in the file to inform the execution 
of the reset command.

## Program

### About unique identifier
Configure the instance_code field in agent.conf to ensure that it is globally 
unique and meaningful, for example (serviceCode_ip_1), so that the operation 
and maintenance personnel can quickly identify the server where the agent is 
located.

### About problem finding
The instance heartbeat protocol adds instance_code. The server-side checks the 
id from the cache and persistent storage. If there is a value, check whether 
the insatance_code is consistent. If the id is inconsistent or cannot be found, 
the print log (including instanse_code) is sent to notify the user to reset the 
agent.

### About the order
- The listener thread then checks the status value in the register.status every 
10 seconds. If it is registered, it will clear the cache of service, instance, 
network, and endpoint.

- Because the agent-side segment generates network_id and srvice_id, the 
segment after the cache is emptied is discarded directly before being 
registered and returned before the network and endpoint. Before the service and 
instance are registered and returned, the segment is not converted.

- Reset status feedback. After the cache is cleared, status is set to the empty 
string "". The changes in service_id and instance_id will be written to the 
file during the reset process.

### About register.status file read and write
Contains 2 attributes
Status -> if register triggers registration
Service_id -> indicates the current service_id value, telling the user that the 
service_id registration is successful.
Instance_id -> The current service_id value, indicating that the user 
instance_id is successfully registered.

[ Full content available at: 
https://github.com/apache/incubator-skywalking/issues/1631 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to