Re: A question about trace and jvm metrics lost

dafang Mon, 13 Sep 2021 04:28:26 -0700

yes,our version is 8.1.0,and we have made some customized development based on 
version 8.1.0,upgrade version may cost more.However, we will investigate the 
new version as soon as possible and strive to use it as soon as possible


Thank you very much!














在 2021-09-13 19:23:31，"Sheng Wu" <wu.sheng.841...@gmail.com> 写道：
>I think you are using old release, after 8.7.0, many things are changed to
>improve performance.
>Many less resources are required.
>
>dafang <13240156...@163.com>于2021年9月13日 周一下午7:18写道：
>
>> OK.I think I have found the reason.Now share to you.
>> I have found that if I set es-bulk size equals 5,then the es "request too
>> large" error will never apper.But at the same time,the grpc server will
>> happen some error,such as "cancelled before receiving half close", and it
>> makes sw-agent can't send data(trace or jvm) to server anymore.This seems
>> to require a balance between grpc receive speed and ES write speed to find
>> a balance poin
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 在 2021-09-13 17:45:40，"Sheng Wu" <wu.sheng.841...@gmail.com> 写道：
>> >Unknown means unknown I am afraid.
>> >I can't explain it. Firewall, proxy, security policy, etc. could you
>> >any of them or others.
>> >
>> >Sheng Wu 吴晟
>> >Twitter, wusheng1108
>> >
>> >dafang <13240156...@163.com> 于2021年9月13日周一 下午4:37写道：
>> >>
>> >> Hello god wu.Through my check, I have found that there some error info
>> in my skywalking-agent logs,such as "Send UpstreamSegment to collector fail
>> with a grpc internal exception.
>> org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
>> UNAVAILABLE: Network closed for unknown reason"
>> >> How to explain it?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> At 2021-09-13 15:05:24, "Sheng Wu" <wu.sheng.841...@gmail.com> wrote:
>> >> >(1) All data in that bulk(ElasticSearch concept, read their doc) will
>> >> >be lost, yes.
>> >> >(2) This only means your agent gets disconnected from Server
>> >> >unexpectedly. For a reason about why, it wouldn't tell.
>> >> >
>> >> >About what you described in Chinese, first of all, it is better to
>> >> >keep Chinese and English consistent, don't put more information on one
>> >> >side, it is confusing.
>> >> >Why the agent will be disconnected forever, it can't be told from what
>> >> >you have provided.
>> >> >Auto reconnecting is working normally AFAIK.
>> >> >
>> >> >Sheng Wu 吴晟
>> >> >Twitter, wusheng1108
>> >> >
>> >> >dafang <13240156...@163.com> 于2021年9月13日周一 下午2:58写道：
>> >> >>
>> >> >> And now.  I have two questions
>> >> >> 1.if this error exist,will all trace and jvm metric be lost?
>> >> >> 2.if there some msg in server logs just
>> like:"org.apache.skywalking.oap.server.receiver.trace.provider.handler.v8.grpc.TraceSegmentReportServiceHandler
>> - 86 [grpcServerPool-1-thread-7] ERROR [] - CANCELLED: cancelled before
>> receiving half close
>> >> >> io.grpc.StatusRuntimeException: CANCELLED: cancelled before
>> receiving half close"
>> >> >> will this make trace or jvm metrics be lost?
>> >> >>
>> >> >>
>> >> >>
>> 中文解释一下：我现在线上100多台机器，就会经常出现某些实例机器是好的，但是就会经常出现机器trace指标或者jvm指标丢失后就完全不会再出现，除非重启服务，我上面列举的这两个情况会导致我预见的这种情况么？
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> 在 2021-09-13 14:50:14，"Sheng Wu" <wu.sheng.841...@gmail.com> 写道：
>> >> >> >That error does matter. HTTP too large will make ElasticSearch
>> reject.
>> >> >> >your bulk insert, which causes data loss.
>> >> >> >
>> >> >> >Sheng Wu 吴晟
>> >> >> >Twitter, wusheng1108
>> >> >> >
>> >> >> >dafang <13240156...@163.com> 于2021年9月13日周一 下午2:23写道：
>> >> >> >>
>> >> >> >> Hi skywalking dev team:
>> >> >> >> In our prod env,I had found that the trace and jvm metrics lost
>> after some service start . And agent logs show no error info.Only server
>> log show: "Es 413 request too large".Will this problem cause complete data
>> loss?
>> >> >> >>
>> >> >> >>
>> >> >> >> 我用中文再形容一下：
>> >> >> >>
>> 最近发现我们线上服务集群原本有15台机器，但是接入skywalking之后，有一部分(大概5-6台)，过了一段时间之后，trace指标或者jvm指标或者两者同时
>> 会消失，但是此时该服务是可以继续提供服务的，只是监控数据没有了。经过排查
>> 发现agent-log中没有任何错误信息，仅在服务端的日志中找到一些"413 request too large"的es报错，我想咨询一下
>> ，这个问题会导致trace或者jvm指标入库失败之后，再也不会采集存储了么？
>> >> >> >>
>> >> >> >>
>> >> >> >> wait for your help
>> >> >> >> yours
>> >> >> >> 大方
>> >> >> >> 2021.09.13
>>
>-- 
>Sheng Wu 吴晟
>
>Apache SkyWalking
>Apache Incubator
>Apache ShardingSphere, ECharts, DolphinScheduler podlings
>Zipkin
>Twitter, wusheng1108

Re: A question about trace and jvm metrics lost

Reply via email to