wu-sheng commented on issue #13764:
URL: https://github.com/apache/skywalking/issues/13764#issuecomment-4656381707

   Fixed on `master` of skywalking-nodejs in apache/skywalking-nodejs#132 (will 
ship in the next release).
   
   **Root cause.** While the collector is unreachable, the trace report loop 
runs every second and each gRPC call fails *failfast* with `14 UNAVAILABLE … 
connect ECONNREFUSED`. On every tick the agent logged the full error — 
including its multi-KB stack — via `logger.error('Failed to report trace data', 
error)`. Those winston log records accumulated in the logger's internal stream 
buffer (the `WritableState.bufferedRequest` linked list visible in your heap 
dump) faster than the transport could drain them, until the heap was exhausted.
   
   **Fix (apache/skywalking-nodejs#132).**
   - Skip the report attempt while the gRPC channel is not `READY`. gRPC-js 
already reconnects on its own exponential backoff, so we keep the (bounded) 
buffer instead of opening a doomed stream and logging an error every second.
   - Throttle and slim failure logging in the Trace and Heartbeat clients: at 
most one line per 30s carrying a `suppressed` count, reduced to the error 
`code`/`message` so no stack is retained.
   - Fixed `SW_AGENT_MAX_BUFFER_SIZE` / `SW_AGENT_TRACE_TIMEOUT` env parsing, 
which never took effect (`Number.isSafeInteger` on a string is always `false`), 
so the buffer cap is now actually tunable.
   
   Thanks @maxming2333 for the detailed heap dump and for #130 — your 
root-cause analysis (the gRPC `pickQueue` retaining segment objects until the 
deadline, plus the `segments-sent` limiter behavior) was spot-on and matches 
the connectivity gate we merged.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to