chewbranca opened a new issue, #5122:
URL: https://github.com/apache/couchdb/issues/5122
While investigating the following stacktrace, I noticed the fabric_rpc
timeout was occurring during rexi stream initialization, yet the corresponding
metric from `couch_stats:increment_counter([rexi, streams, timeout,
init_stream])` was not triggering, and it turns out it'll never properly
trigger.
> mfa: fabric_rpc:all_docs/3 exit:timeout
[{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,300}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,241}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,492}]},{couch_mrview,map_fold,3,[{file,"src/couch_mrview.erl"},{line,504}]},{couch_bt_engine,include_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1155}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1151}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,1167}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,1138}]}]
The fundamental problem is that there are two nearly identical rexi stream
initialization functions except one of them does the proper metrics
incrementing and the other function is actually utilized.
## The two functions are:
### 1) `rexi:stream_init/1`
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L165-L179
```erlang
stream_init(Timeout) ->
case sync_reply(rexi_STREAM_INIT, Timeout) of
rexi_STREAM_START ->
ok;
rexi_STREAM_CANCEL ->
exit(normal);
timeout ->
couch_stats:increment_counter(
[rexi, streams, timeout, init_stream]
),
exit(timeout);
Else ->
exit({invalid_stream_message, Else})
end.
```
and
### 2) `rexi:init_stream/1`
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L292-L303
```erlang
init_stream(Timeout) ->
case sync_reply(rexi_STREAM_INIT, Timeout) of
rexi_STREAM_START ->
put(rexi_STREAM_INITED, true),
ok;
rexi_STREAM_CANCEL ->
exit(normal);
timeout ->
exit(timeout);
Else ->
exit({invalid_stream_message, Else})
end.
```
As you can see these two functions are nearly identical except
`rexi:stream_init` increments the counter on `timeout` and `rexi:init_stream`
does a `put(rexi_STREAM_INITED, true),` to track stream init status. However,
`rexi:stream_init` is never actually called:
```
chewbranca@Russells-MacBook-Pro-2 db % grep -r rexi:stream_init src
chewbranca@Russells-MacBook-Pro-2 db %
```
and instead the internal function `rexi:init_stream` is utilized by way of
`rexi:maybe_init_stream` in `rexi:stream2` and `rexi:stream_last`.
The result is that we never properly increment the metric indicating
stream_init timeouts.
I think the idea was that `rexi:stream2` has a try-catch block around
`throw:timeout` in
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L248-L252 and
then that'll take care of incrementing the counter, but `maybe_init_stream` is
called _outside_ of that try-catch block
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L241 so it
would never be caught, however, `maybe_init_stream` doesn't _throw_ a timeout,
it does a process exit with `Reason=timeout`
https://github.com/apache/couchdb/blob/main/src/rexi/src/rexi.erl#L300 so
that'll never be caught by a try-catch.
It seems possible we can fix the underlying lack of metrics by way
consolidating `stream_init` and `init_stream` down to a singular function that
both sets the pdict value and increments the counter, but that might result in
a double counting if the old `rexi:stream` is used instead of `rexi:stream2`,
and it doesn't fix the underlying issue that there's various flow control
elements expecting to catch on `throw(timeout)` which is not possible with
`exit(timeout)`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]