Hi Gustaf thank you so much for such a comprehensive reply. I feel like I just increased my IQ by a few percentage! A couple of follow up questions, if I may:
1. "sockAccept accepted 2 connections" - this was very useful to know, I will start keeping an eye out for values > 10. It would be very useful to be able to configure the threshold at which it appears in the log, as I mentioned we see a huge amount of these (over 2000 in the last few hours on our test server, almost all of them for 2 connections). Also regarding configuring a second driver thread - just to be clear are you referring to this mail you sent back in 2016 https://sourceforge.net/p/naviserver/mailman/message/35502664/ i.e. enable reuseport and set driverthreads = 2? It's unclear to me if this has benefit when running Naviserver in a Docker container - definitely beyond my level of comprehension. See https://github.com/moby/moby/issues/7536#issuecomment-1039443599 for discussion of SO_REUSEPORT and Docker. 2. "ns_cache create entry collision cache util_memoize key '__util_memoize_installed_p" - again very helpful. We actually have been partitioning our caches, and try not to add anything new to util-memoize cache. In fact, this particular message about '__util_memoize_installed_p' is the only cache collision notice we are seeing. Your comment helped me to track down to issue to the "util_memoize_initialized_p" and "apm_package_installed_p" procs, and I can see the new improved versions in later OpenACS versions, so I will try those out. 3. "exiting: exceeded max connections per thread" - actually we used to have "connsperthread" set to 10000, but I reduced to 1000 when we moved to Linux, as I noticed that was the value in latest OpenACS config, even though the default is still 10000 https://bitbucket.org/naviserver/naviserver/src/main/openacs-config.tcl Is 1000 the recommended value for OpenACS? thanks again Brian ________________________________ From: Gustaf Neumann <neum...@wu.ac.at> Sent: Wednesday 8 November 2023 9:00 am To: naviserver-devel@lists.sourceforge.net <naviserver-devel@lists.sourceforge.net> Subject: Re: [naviserver-devel] A selection of messages appearing in our error logs Hi Brian, In general, we try to follow with the logging severities the guideline [1]. "Notice" is informational, "warning" is something unexpected/unwanted (hitting some limits, got some data that looks like an attack, ...), "error" is something which should be looked at by an engineer. [1] https://openacs.org/xowiki/logging-conventions below, I comment inline. On 07.11.23 18:28, Brian Fenton wrote: Hi with the goal of improving signal-to-noise in our error logs, and also to increase my understanding of Naviserver internals, I have a few questions about various different messages we are seeing regularly in our logs. 1. [06/Nov/2023:14:58:08][77320.7f7b9a7fd640][-driver:nsssl:0-] Notice: ... sockAccept accepted 2 connections this is by far the most common entry in our logs - we see a LOT of these. Are they merely informative or indicative of some tuning action we could take? I don't see any way to disable them. I would love to remove these. This indicates some stress situation of the server, where multiple requests come in at the same time. The value of 2 is not of a concern, if you see values like 10 or 20, you should consider a change in your configuration (e.g., configuring a second driver thread). It would be possible to define a configuration value for the network driver to set the threshold when the notice is written to the log file. 1. 2. [06/Nov/2023:14:57:28][77320.7f7b63fff640][-conn:openacs:default:6:6292-] Notice: ns_cache create entry collision cache util_memoize key '__util_memoize_installed_p', no timeout We also see a quite a few of these, always for '__util_memoize_installed_p', and always with 'no timeout'. From looking at cache.c I see that this means that Naviserver failed to create the cache entry. I checked the 'util_memoize' cache utilization in case the cache was full, but it's quite low. What other cause could there be for this? This means that concurrent operations are executed to obtain the same cache value, and that the second request (here conn:openacs:default:6:6292) has to wait until the locking process has finished. The problem can become serious, when the locking process "hangs", i.e. taking a very long time to finish. This means, that more and more of caching requests for the same entry will pile up, potentially until a restart, since no timeout was specified. The warning was introduced to give the developer a hint, why suddenly the server is hanging (before the message was introduced, the cause was very hard to determine). This message tells as well something else: * It looks as if the installation is based on a very old version of openacs-core (including in particular acs-tcl) installed. At least since 9 years, the code has changed to perform this test just at startup. Upgrade to recent versions will improve the performance and increase security. * The message tells me as well that the application has a stress on the util-memoize cache. Old versions of OpenACS used this "util-memoize" cache as a kitchen-sink cache for everything. I've seen instances having 300K + entries in this cache. The size itself is not the problem, but there were/are many operations in OpenACS that obtain all keys from the cache. It means that a lock is created, a list with 300K+ entries is created and handed to the interpreter before the cache is released. In the meantime, *all* caching requests for this cache have to be halted, ... causing in high load situations degraded performance and limited concurrency. I would not be surprised, if you see high lock times on this cache. There are caches aside of the util-memoize cache, where cache collisions might happen. See e.g. [2] for more background. Cache partitioning is in this situation the instrument for scaling. [2] https://openacs.org/xowiki/cache-size-tuning 1. 2. These 5 appear usually as a group: [07/Nov/2023:13:46:07][38.7f352cff9640][-conn:openacs:default:3:4088-] Notice: exiting: exceeded max connections per thread [07/Nov/2023:13:46:07][38.7f3525b84640][-driver:nsssl:0-] Notice: NsEnsureRunningConnectionThreads wantCreate 1 waiting 1 idle 0 current 2 [07/Nov/2023:13:46:07][38.7f352d7fa640][-conn:openacs:default:5:0-] Notice: start update interpreter openacs to epoch 1, concurrent 1 [07/Nov/2023:13:46:08][38.7f352d7fa640][-conn:openacs:default:5:0-] Notice: update interpreter openacs to epoch 1 done, trace none, time 0.373006 secs concurrent 1 [07/Nov/2023:13:46:08][38.7f352d7fa640][-conn:openacs:default:5:0-] Notice: thread initialized (0.388878 secs) We see these pretty regularly, over 50 per day on a local development server. "maxconnections" is set to the default value, which is 100, I believe. "maxthreads" and "minthreads" are also set to default values. What could be causing these so regularly on a system with 1 logged-in user? The message tells that a connection thread has reached its end of live, as defined by the "connsperthread" parameter. When this happens, the thread is stopped, and depending on the configuration parameters, a new thread is created maybe immediately or on demand. The reason for stopping connection threads from time to time is to clean up some interpreter specific data, which was accumulated by some applications. With the modern versions of OpenACS and with well-behaved packages, this is not an issue, the value can be increased substantially. The relevant counter does not depend on the number of different users, but on the number of requests handled by a particular connection thread. 1. 2. These 3 are related to each other, I presume: [07/Nov/2023:15:36:24][38.7f3536b86640][-conn:openacs:default:28:28082-] Notice: nsdb: closing old handle in pool 'pool1' [07/Nov/2023:15:45:39][38.7f3536b86640][-conn:openacs:default:31:31287-] Notice: nsdb: closing idle handle in pool 'pool1' [07/Nov/2023:15:45:39][38.7f3536b86640][-conn:openacs:default:31:31287-] Notice: dbdrv: opening database 'nsoracle:192.168.1.1/ORACLE' These are quite frequent too. Am I correct in thinking that we can tune these with pool "maxidle" and "maxopen" params mentioned here https://naviserver.sourceforge.io/n/nsdb/files/ns_db.html ? Is there any reason not to turn off the automatic closing feature? There might be some reasons for keeping the number of db connections low (e.g. many servers use the same database with a limited number of client connections, etc.), since at least 20 years there is no technical reason for the common db drivers not to turn off the automatic closing feature. Reconnecting to the database can be costly, when the db is running on a different host and/or the connection is via HTTPS. hope this helps a little all the best -g
_______________________________________________ naviserver-devel mailing list naviserver-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/naviserver-devel