Hi Gustaf

thank you so much for such a comprehensive reply. I feel like I just increased 
my IQ by a few percentage! A couple of follow up questions, if I may:

1. "sockAccept accepted 2 connections" - this was very useful to know, I will 
start keeping an eye out for values > 10. It would be very useful to be able to 
configure the threshold at which it appears in the log, as I mentioned we see a 
huge amount of these (over 2000 in the last few hours on our test server, 
almost all of them for 2 connections).
Also regarding configuring a second driver thread - just to be clear are you 
referring to this mail you sent back in 2016 
https://sourceforge.net/p/naviserver/mailman/message/35502664/  i.e. enable 
reuseport and set driverthreads = 2? It's unclear to me if this has benefit 
when running Naviserver in a Docker container - definitely beyond my level of 
comprehension. See 
https://github.com/moby/moby/issues/7536#issuecomment-1039443599 for discussion 
of SO_REUSEPORT and Docker.

2. "ns_cache create entry collision cache util_memoize key 
'__util_memoize_installed_p" - again very helpful. We actually have been 
partitioning our caches, and try not to add anything new to util-memoize cache. 
In fact, this particular message about  '__util_memoize_installed_p' is the 
only cache collision notice we are seeing. Your comment helped me to track down 
to issue to the "util_memoize_initialized_p" and "apm_package_installed_p" 
procs, and I can see the new improved versions in later OpenACS versions, so I 
will try those out.

3. "exiting: exceeded max connections per thread" - actually we used to have 
"connsperthread" set to 10000, but I reduced to 1000 when we moved to Linux, as 
I noticed that was the value in latest OpenACS config, even though the default 
is still 10000 
https://bitbucket.org/naviserver/naviserver/src/main/openacs-config.tcl  Is 
1000 the recommended value for OpenACS?

thanks again
Brian
________________________________
From: Gustaf Neumann <neum...@wu.ac.at>
Sent: Wednesday 8 November 2023 9:00 am
To: naviserver-devel@lists.sourceforge.net 
<naviserver-devel@lists.sourceforge.net>
Subject: Re: [naviserver-devel] A selection of messages appearing in our error 
logs


Hi Brian,


In general, we try to follow with the logging severities the guideline [1]. 
"Notice" is informational, "warning" is something unexpected/unwanted (hitting 
some limits, got some data that looks like an attack, ...), "error" is 
something which should be looked at by an engineer.


    [1] https://openacs.org/xowiki/logging-conventions


below, I comment inline.



On 07.11.23 18:28, Brian Fenton wrote:
Hi

with the goal of improving signal-to-noise in our error logs, and also to 
increase my understanding of Naviserver internals, I have a few questions about 
various different messages we are seeing regularly in our logs.


  1.  [06/Nov/2023:14:58:08][77320.7f7b9a7fd640][-driver:nsssl:0-] Notice: ... 
sockAccept accepted 2 connections
this is by far the most common entry in our logs - we see a LOT of these. Are 
they merely informative or indicative of some tuning action we could take? I 
don't see any way to disable them. I would love to remove these.


This indicates some stress situation of the server, where multiple requests 
come in at the same time. The value of 2 is not of a concern, if you see values 
like 10 or 20, you should consider a change in your configuration (e.g., 
configuring a second driver thread). It would be possible to define a 
configuration value for the network driver to set the threshold when the notice 
is written to the log file.


  1.

  2.
[06/Nov/2023:14:57:28][77320.7f7b63fff640][-conn:openacs:default:6:6292-] 
Notice: ns_cache create entry collision cache util_memoize key 
'__util_memoize_installed_p', no timeout
We also see a quite a few of these, always for  '__util_memoize_installed_p', 
and always with 'no timeout'. From looking at cache.c I see  that this means 
that Naviserver failed to create the cache entry. I checked the 'util_memoize' 
cache utilization in case the cache was full, but it's quite low. What other 
cause could there be for this?

This means that concurrent operations are executed to obtain the same cache 
value, and that the second request (here conn:openacs:default:6:6292) has to 
wait until the locking process has finished. The problem can become serious, 
when the locking process "hangs", i.e. taking a very long time to finish. This 
means, that more and more of caching requests for the same entry will pile up, 
potentially until a restart, since no timeout was specified. The warning was 
introduced to give the developer a hint, why suddenly the server is hanging 
(before the message was introduced, the cause was very hard to determine).


This message tells as well something else:
* It looks as if the installation is based on a very old version of 
openacs-core (including in particular acs-tcl) installed. At least since 9 
years, the code has changed to perform this test just at startup. Upgrade to 
recent versions will improve the performance and increase security.


* The message tells me as well that the application has a stress on the 
util-memoize cache. Old versions of OpenACS used this "util-memoize" cache as a 
kitchen-sink cache for everything. I've seen instances having 300K + entries in 
this cache. The size itself is not the problem, but there were/are many 
operations in OpenACS that obtain all keys from the cache. It means that a lock 
is created, a list with 300K+ entries is created and handed to the interpreter 
before the cache is released. In the meantime, *all* caching requests for this 
cache have to be halted, ... causing in high load situations degraded 
performance and limited concurrency. I would not be surprised, if you see high 
lock times on this cache.


There are caches aside of the util-memoize cache, where cache collisions might 
happen. See e.g. [2] for more background. Cache partitioning is in this 
situation the instrument for scaling.


   [2] https://openacs.org/xowiki/cache-size-tuning


  1.

  2.  These 5 appear usually as a group:
[07/Nov/2023:13:46:07][38.7f352cff9640][-conn:openacs:default:3:4088-] Notice: 
exiting: exceeded max connections per thread
[07/Nov/2023:13:46:07][38.7f3525b84640][-driver:nsssl:0-] Notice: 
NsEnsureRunningConnectionThreads wantCreate 1 waiting 1 idle 0 current 2
[07/Nov/2023:13:46:07][38.7f352d7fa640][-conn:openacs:default:5:0-] Notice: 
start update interpreter openacs to epoch 1, concurrent 1
[07/Nov/2023:13:46:08][38.7f352d7fa640][-conn:openacs:default:5:0-] Notice: 
update interpreter openacs to epoch 1 done, trace none, time 0.373006 secs 
concurrent 1
[07/Nov/2023:13:46:08][38.7f352d7fa640][-conn:openacs:default:5:0-] Notice: 
thread initialized (0.388878 secs)
We see these pretty regularly, over 50 per day on  a  local development server. 
 "maxconnections" is set to the  default value, which is 100, I believe.  
"maxthreads" and "minthreads" are also set to default values. What could  be 
causing these so regularly on a system with 1 logged-in user?


The message tells that a connection thread has reached its end of live, as 
defined by the "connsperthread" parameter. When this happens, the thread is 
stopped, and depending on the configuration parameters, a new thread is created 
maybe immediately or on demand. The reason for stopping connection threads from 
time to time is to clean up some interpreter specific data, which was 
accumulated by some applications. With the modern versions of OpenACS and with 
well-behaved packages, this is not an issue, the value can be increased 
substantially.


The relevant counter does not depend on the number of different users, but on 
the number of requests handled by a particular connection thread.


  1.

  2.
 These 3 are related to each other,  I presume:
[07/Nov/2023:15:36:24][38.7f3536b86640][-conn:openacs:default:28:28082-] 
Notice: nsdb: closing old handle in pool 'pool1'
[07/Nov/2023:15:45:39][38.7f3536b86640][-conn:openacs:default:31:31287-] 
Notice: nsdb: closing idle handle in pool 'pool1'
[07/Nov/2023:15:45:39][38.7f3536b86640][-conn:openacs:default:31:31287-] 
Notice: dbdrv: opening database 'nsoracle:192.168.1.1/ORACLE'
These are quite frequent too. Am I correct in thinking that we can tune these 
with pool "maxidle" and "maxopen" params mentioned here 
https://naviserver.sourceforge.io/n/nsdb/files/ns_db.html ? Is there any reason 
not to turn off the automatic closing feature?

There might be some reasons for keeping the number of db connections low (e.g. 
many servers use the same database with a limited number of client connections, 
etc.), since at least 20 years there is no technical reason for the common db 
drivers not to turn off the automatic closing feature. Reconnecting to the 
database can be costly, when the db is running on a different host and/or the 
connection is via HTTPS.


hope this helps a little


all the best

-g
_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Reply via email to