Re: [dbcp] Optimal defaults for DSpace

2021-01-01 Thread Phil Steitz




On 12/20/20 10:26 PM, Hrafn Malmquist wrote:

Hi Gary

Thanks for taking the time to respond.

I hope you can bear with me as I am still learning about database
connection pooling.

Perhaps I did not ask the question correctly. I am not asking about a site
specific setup but rather what defaults should be shipped with the
software. I am part of the minor version release team.

Currently, the default setup is a DBCP2 v. 2.1.1 connection pool with
only maxWaitMillis,
maxIdle and maxTotal configurable in the DSpace configuration settings and
the default values for these settings set to 5000, 10 and 30 respectively.
It's unclear why these defaults were chosen to begin with, git blame shows
they were chosen back in 2015. I don't think a lot of thought went into
choosing 1) which parameters should be configurable nor 2) what their
defaults should be (or why they should differ from DBCP2 defaults).

DSpace repositories are run by higher education institutions and all sorts
of institutions and organisations involved in research, for instance the
Smithsonian (https://repository.si.edu/). Therefore, although the vast
majority of instances are run by small institutions that get little
traffic, others are likely to receive relatively heavy traffic, from users
and crawlers.

So the idea is to ask the experts what parameters should be configurable
for the average repository admin, keeping in mind that the aim is for
installation and setup to be simple (in effect, what are the "main"
parameters likely to need tweaking) and what should the out-of-the-box
defaults be (if at all different from the DBCP2 defaults).

I am particularly surprised at the low maxWaitMillis chosen. Is that not
likely to cause problems for high traffic sites?


I would say no.  Having threads blocked waiting for connections for 
longer than 5 seconds will likely cause problems in heavily loaded 
applications.  You will end up running out of app server processing 
threads if they are hanging for that long.   If getConnection is taking 
that long, there is likely a problem somewhere in the overall system - 
processing threads holding connections too long, not enough connections, 
database latency, etc.  It all comes down to queuing theory.  If your 
app does not hold connections long and queries are optimized, even a 
relatively small pool can handle decent load.  The key is to not to 
leave connections open or hold on to them too long.


The defaults above look OK to me, though if database connections are not 
in short supply, I would bump maxIdle to 20.  The reason for this is 
that setting it at 10 means that if the number used regularly goes up to 
20+, you will end up with a lot of connection churn.  On the other hand, 
if the usage pattern is spikes now and then followed by long periods of 
lighter load, setting it at 20 will "waste" some connections.  How 
important that "waste" is depends on what else is going on in the DB, 
how many pools are sharing it, etc.


I would recommend upgrading to the latest version compatible with the 
version of tc you are running, or simply using the version that ships 
with tomcat (which is generally the latest compatible). Another reason 
to upgrade dbcp if you are using it directly is to pick up the fixes in 
the later version of commons pool that it brings in.


For some general info on how dbcp and pool configs work, see [1]. It is 
old, but the basic concepts are still correct.  If you are familiar with 
queuing theory, you can view a pool with n connections as a M/M/n 
queue.  What drives everything is request arrival rate and service time, 
which in the case of dbcp is how long an application thread holds a 
connection.   You can observe actual utilization using the JMX interfaces.


Phil

[1] https://www.slideshare.net/psteitz/apachecon2014-pooldbcp


Best regards, Hrafn


[1] :
https://github.com/DSpace/DSpace/blob/250c87dc1604c34e2a963b6804163c73278e9ff7/dspace/config/spring/api/core-hibernate.xml#L41-L48

[2] :
https://github.com/DSpace/DSpace/blob/250c87dc1604c34e2a963b6804163c73278e9ff7/dspace/config/dspace.cfg#L77-L86

On Sun, Dec 20, 2020 at 6:40 PM Gary Gregory  wrote:


Hi,

Each new DBCP release brings fixes, additions,  and other updates, as you
can read in the release notes.

How to best configure DBCP for any given combination of JDBC driver, its
database, and application will be quite variable, which is somewhat out of
scope here IMO.

Gary

On Fri, Dec 18, 2020, 11:15 Hrafn Malmquist 
wrote:


Good day

I'm wondering what are optimal defaults for DSpace, open source digital
repository software aimed especially at  academic, non-profit, and
commercial organizations (see https://duraspace.org/dspace/).

DSpace supports both Postgres and Oracle and recommends Tomcat, Jetty or
Caucho Resin. I suspect 9/10 installations use Tomcat.

DSpace comes packaged with Apache Commons DCBP 2.1.1. DSpace only
configures three configurations for DBCP2 using non-default settings.

(see:

[1] and [2])


Re: [dbcp] Optimal defaults for DSpace

2021-01-01 Thread Hrafn Malmquist
>> Hi Gary
>>
>> I have and they don't know. Therefore, we are kind of looking at this
>> afresh.
>>
>> For a web server like this, where there are usually lots of reads and not
>> many writes.
>>
>
>DBCP is agnostic to reading vs. writing, that all happens in SQL as I am
>sure you know ;-)

When I think about it it's obvious that it doesn't matter what happens
during the connection session.
The fact that I offer that piece of useless information only shows how much
I am struggling to understand what should guide a decision for optimal
defaults.

> Does having defaults:
>> maxWaitMillis = 5000,
>> maxIdle = 10,
>> maxTotal = 30
>>
>> Make more sense than the DCP2 defaults?
>>
>
>Only if you think so, I'm sorry I can't offer any guidelines for your
>application.

I appreciate that you are hesitant to offer generic advice. Nonetheless you
are clearly an authority in this field being the main committer to the
DBCP2 codebase.

For Tomcat 8 it is explicitly recommended that maxWaitMillis not be set to
lower than 10 seconds, preferably 10-15 seconds [1]

Consider Deep Blue, the DSpace institutional repository for the University
of Michigan [2] Taken at face value, it is likely that this web site gets
high traffic as it is a relatively popular institution with a lot of
content (130k > items).

It is likely of course that the db administrator running it knows enough
about connection pooling to calibrate the settings to something more
sensible but as I am sure you understand it would be better if the defaults
that come with DSpace are as close to optimal settings as possible.

Correct me if I'm wrong, my understanding is that since maxWaitMillis
causes exceptions to be raised on expiry, a codebase that uses a relatively
short setting would need to be defensively coded to handle exceptions very
well. Considering the fragmentary and decentralized way that DSpace has
been developed (the classic open source way) I think it is fair to say that
the codebase isn't very resilient. Therefore, not least in light of the
abovementioned recommendations for Tomcat settings, the optimal generic
setting for maxWaitMillis is at least 1.

1 -
https://tomcat.apache.org/tomcat-8.0-doc/jndi-datasource-examples-howto.html#Intermittent_Database_Connection_Failures
2 - https://deepblue.lib.umich.edu/

On Thu, Dec 31, 2020 at 5:41 PM Gary Gregory  wrote:

> On Thu, Dec 31, 2020 at 11:55 AM Hrafn Malmquist <
> hrafn.malmqu...@gmail.com>
> wrote:
>
> > Hi Gary
> >
> > I have and they don't know. Therefore, we are kind of looking at this
> > afresh.
> >
> > For a web server like this, where there are usually lots of reads and not
> > many writes.
> >
>
> DBCP is agnostic to reading vs. writing, that all happens in SQL as I am
> sure you know ;-)
>
>
> > Does having defaults:
> > maxWaitMillis = 5000,
> > maxIdle = 10,
> > maxTotal = 30
> >
> > Make more sense than the DCP2 defaults?
> >
>
> Only if you think so, I'm sorry I can't offer any guidelines for your
> application.
>
>
> >
> > maxWaitMillis = indefinitely,
> > maxIdle = 8,
> > maxTotal = 8
> >
> > Perhaps having higher maxIdle and maxTotal can't hurt as these are
> maximum
> > bounds but the unusually (right?) low maxWaitMillis seems like it could
> > easily cause problems, right?
> >
>
> Maybe some else here has generic advice for you but I do not, as each
> customer I've seen at work all have highly variable needs, configurations,
> and operating environments, everything from Linux, Windows, to IBM i/Series
> and z/Series.
>
>
> > Also, these are the only properties wrapped into the configurable DSpace
> > configuration. What other properties are those most commonly tweaked from
> > DBCP2 defaults?
> >
>
> Again, this is highly dependent on your use case. You'll have to experiment
> within your operating envirnoment.
>
> Gary
>
>
> > Happy new year
> > Hrafn
> >
> > On Tue, Dec 29, 2020 at 2:31 PM Gary Gregory 
> > wrote:
> >
> > > Hi,
> > >
> > > I think you will have to ask the Dspace committers why they chose those
> > > specific values.
> > >
> > > Gary
> > >
> > > On Mon, Dec 21, 2020, 00:27 Hrafn Malmquist  >
> > > wrote:
> > >
> > > > Hi Gary
> > > >
> > > > Thanks for taking the time to respond.
> > > >
> > > > I hope you can bear with me as I am still learning about database
> > > > connection pooling.
> > > >
> > > > Perhaps I did not ask the question correctly. I am not asking about a
> > > site
> > > > specific setup but rather what defaults should be shipped with the
> > > > software. I am part of the minor version release team.
> > > >
> > > > Currently, the default setup is a DBCP2 v. 2.1.1 connection pool with
> > > > only maxWaitMillis,
> > > > maxIdle and maxTotal configurable in the DSpace configuration
> settings
> > > and
> > > > the default values for these settings set to 5000, 10 and 30
> > > respectively.
> > > > It's unclear why these defaults were chosen to begin with, git blame
> > > shows
> > > > they were chosen back in 2015. I don't think a lot