nickva commented on issue #5886:
URL: https://github.com/apache/couchdb/issues/5886#issuecomment-3923505944

   This is pretty strange...
   
   The stats definitions for each application (in this case `mem3`) are 
periodically reloaded every 10 minutes. I wonder if that is related to this.
   
   The path they are loaded it from is determined by this call (you can try in 
remsh as well):
   
   ```
   > code:priv_dir(mem3).
   ```
   
   In your case based on the output of command1.txt it would be in:
   
   ```
   /usr/local/apache-couchdb-fips/couchdb/bin/../lib/mem3-3.5.0/priv
   ```
   
   Would it possible that something on your system is periodically altering 
those files while the couchdb service is running, maybe changing their 
permissions, updating them somehow. Or maybe there is something causing file 
system read failures (used up all the file descriptor, volume is unmounted)?
   
   If the periodic reloading is involved in this it may manifest as the error 
not showing up and then showing up during an interval of 10 minutes then 
perhaps disappearing again.
   
   When metrics are reloaded by the couch_stats application it uses this 
function call (can try it remsh):
   
   ```
   > couch_stats_util:load_metrics_for_applications().
   ```
   
   That returns the map of all metrics. It should look like:
   
   ```
   #{[mango,query_time] =>
         {histogram,<<"length of time processing a mango query">>},
     [couchdb,document_inserts] =>
         {counter,<<"number of documents inserted">>},
     [fsync,time] => {histogram,<<"microseconds to call fsync">>},
     [couch_log,level,report_error] =>
         {counter,<<"number of failed report messages">>},
   ...
   ```
   
   If you can catch the system at a time when the error is thrown try running 
some of those remsh commands above we might learn the reason.
   
   Also, if the system is running properly you should be able to sample a 
metric in remsh:
   
   ```
   > couch_stats:sample([mem3, shard_cache, hit]).
   56
   ```
   
   Or if it is missing, for example after I did this on a running instance 
editing the file (and waiting for 10 min):
   
   ```
   {[mem3, shard_cache, hitx], [
       {type, counter},
       {desc, <<"number of shard cache hits">>}
   ]}.
   ```
   
   It would  throw the error in remsh as well as the metric is now undefined:
   
   ```
   > couch_stats:sample([mem3, shard_cache, hit]).
   * exception throw: unknown_metric
       in function  couch_stats_util:sample/4 (src/couch_stats_util.erl, line 
189)
   ```
   
   You can also force a reload by hand in remsh:
   
   ```
   >couch_stats:sample([mem3, shard_cache, hit]).
   0
   ```
   
   Notice that after a successful reload the metrics are also reset back to 0. 
That could be an indication of what might be happening if you noticed that 
periodically your metrics get reset back to 0 without you restarting your 
CouchDB instance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to