Thanks for the quick support, Wu Sheng!

Thanks & Best Regards

Xiaochao Zhang(James)
DI SW CAS MP EMK DO-CHN
No.7, Xixin Avenue, Chengdu High-Tech Zone
Chengdu, China  611731
Email: [email protected] <mailto:[email protected]>

From: Sheng Wu <[email protected]>
Sent: Thursday, January 16, 2020 10:47 AM
To: dev <[email protected]>
Subject: Re: Question about 
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED

Inline


Zhang, James <[email protected]<mailto:[email protected]>> 
于2020年1月16日周四 上午10:34写道:
Dear Skywalking Dev team,
I had deployed Skywaking Java agent & UI/OAP/ES service into backend 
microservices K8S cluster. During our JMeter performance testing we found many 
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED logs both in agent side and OAP server side.
Agent side:
ERROR 2020-01-14 03:50:52:070 
SkywalkingAgent-5-ServiceAndEndpointRegisterClient-0 
ServiceAndEndpointRegisterClient : ServiceAndEndpointRegisterClient execute 
fail.
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED
        at 
org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
ERROR 2020-01-14 03:46:22:069 SkywalkingAgent-4-JVMService-consume-0 JVMService 
: send JVM metrics to Collector fail.
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED
        at 
org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)

OAP server side:
2020-01-14 03:53:18,935 - 
org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient -147226067 
[grpc-default-executor-863] ERROR [] - DEADLINE_EXCEEDED: deadline exceeded 
after 19999979082ns
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 
19999979082ns
               at io.grpc.Status.asRuntimeException(Status.java:526) 
~[grpc-core-1.15.1.jar:1.15.1]

and the respective Instance Throughput curve don
none-flat(with Exception log) curve vs. flat curve(no Exception log)
[cid:16fac3bc8614cff311]  VS. [cid:16fac3bc8625b16b22]

I checked the TraceSegmentServiceClient  and related source code and found that 
this Exception from agent side is an Error consume behavior, but the error data 
is not counted into abandoned data size account.
[cid:16fac3bc862692e333]

I’m wondering that when this gRPC exception occurs, whether the trace data sent 
to OAP server is lost or not?

Most likely, lost.


In case that the trace data is lost, why the lost data is not counted into the 
abandoned data static? And the metric calculation during the trace data lost 
time range is distorted due to incomplete trace data collection?

Because by using gRPC streaming, we don't know how many segments lost.



Is there any configuration needed from agent or/and oap server side to resolve 
this gPRC exception issue to avoid trace data lost?

I think, you should increase the backend resource or resolve the network 
unstable issue.



P.S.
I also met the “trace segment has been abandoned, cause by buffer is full” 
issue before due to the default 5*300 buffer is not enough. In this case trace 
data is lost at agent side directly before sending to OAP collector.

5 * 3000 should be enough for most users unless your system is very high load 
or network is unstable like I said above. When you said 10 * 3000 is better, I 
am guessing your network or network performance is not stable, so you need more 
buffers at the agent side holding the data.


However after I increased the agent side trace data buffer to 10*3000, this 
abandoned issue never occurred again.
http-nio-0.0.0.0-9090-exec-23 TraceSegmentServiceClient : One trace segment has 
been abandoned, cause by buffer is full.

Thanks & Best Regards

Xiaochao Zhang(James)
DI SW CAS MP EMK DO-CHN
No.7, Xixin Avenue, Chengdu High-Tech Zone
Chengdu, China  611731
Email: [email protected] <mailto:[email protected]>

Reply via email to