Inline

Zhang, James <[email protected]> 于2020年1月16日周四 上午10:34写道:

> Dear Skywalking Dev team,
>
> I had deployed Skywaking Java agent & UI/OAP/ES service into backend
> microservices K8S cluster. During our JMeter performance testing we found
> many *org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
> DEADLINE_EXCEEDED* logs both in agent side and OAP server side.
>
> Agent side:
>
> ERROR 2020-01-14 03:50:52:070
> SkywalkingAgent-5-ServiceAndEndpointRegisterClient-0
> ServiceAndEndpointRegisterClient : ServiceAndEndpointRegisterClient execute
> fail.
>
> org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
> DEADLINE_EXCEEDED
>
>         at
> org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
>
> ERROR 2020-01-14 03:46:22:069 SkywalkingAgent-4-JVMService-consume-0
> JVMService : send JVM metrics to Collector fail.
>
> org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
> DEADLINE_EXCEEDED
>
>         at
> org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
>
>
>
> OAP server side:
>
> 2020-01-14 03:53:18,935 -
> org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient
> -147226067 [grpc-default-executor-863] ERROR [] - DEADLINE_EXCEEDED:
> deadline exceeded after 19999979082ns
>
> io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after
> 19999979082ns
>
>                at io.grpc.Status.asRuntimeException(Status.java:526)
> ~[grpc-core-1.15.1.jar:1.15.1]
>
>
>
> and the respective Instance Throughput curve don
>
> none-flat(with Exception log) curve vs. flat curve(no Exception log)
>
>   VS.
>
>
>
> I checked the *TraceSegmentServiceClient*  and related source code and
> found that this Exception from agent side is an Error consume behavior, but
> the error data is not counted into abandoned data size account.
>
>
>
> *I’m wondering that when this gRPC exception occurs, whether the trace
> data sent to OAP server is lost or not?*
>

Most likely, lost.



> *In case that the trace data is lost, why the lost data is not counted
> into the abandoned data static? And the metric calculation during the trace
> data lost time range is distorted due to incomplete trace data collection?*
>

Because by using gRPC streaming, we don't know how many segments lost.



>
>
> *Is there any configuration needed from agent or/and oap server side to
> resolve this gPRC exception issue to avoid trace data lost?*
>

I think, you should increase the backend resource or resolve the network
unstable issue.



>
>
> *P.S.*
>
> I also met the “*trace segment has been abandoned, cause by buffer is
> full*” issue before due to the default 5*300 buffer is not enough. In
> this case trace data is lost at agent side directly before sending to OAP
> collector.
>

5 * 3000 should be enough for most users unless your system is very high
load or network is unstable like I said above. When you said 10 * 3000 is
better, I am guessing your network or network performance is not stable, so
you need more buffers at the agent side holding the data.



> However after I increased the agent side trace data buffer to 10*3000,
> this abandoned issue never occurred again.
>
> http-nio-0.0.0.0-9090-exec-23 TraceSegmentServiceClient : One trace
> segment has been abandoned, cause by buffer is full.
>
>
>
> Thanks & Best Regards
>
>
>
> Xiaochao Zhang(James)
>
> DI SW CAS MP EMK DO-CHN
>
> No.7, Xixin Avenue, Chengdu High-Tech Zone
>
> Chengdu, China  611731
>
> Email: [email protected]
>
>
>

Reply via email to