Thanks, Mike - I (and hopefully future generations of Evergreen admins) appreciate the clarifications!
Chris ----- Original Message ----- > From: "Mike Rylander" <mrylan...@gmail.com> > To: "Evergreen Discussion Group" <open-ils-general@list.georgialibraries.org> > Sent: Wednesday, February 10, 2016 3:56:01 PM > Subject: Re: [OPEN-ILS-GENERAL] Evergreen 2.9 and PostgreSQL 9.4 - A > Cautionary Tale > > Chris, > > I was happy be able to perform that emergency diagnosis and > immediately implement several of the changes you listed, and glad you > were able to move quickly to implement my recommendations requiring > restarts. I know it was a stressful time for your team, and it's > frustrating when the things you've tried in the past don't help in new > situations. > > I want to provide some context regarding the specific changes I made, > and had you make, just so folks can make the best use of the > information out there. Though I'm sure you know this, the details of > the tuning any given Postgres instance are extremely dependent on the > particular data set, even when you're using the same application > (Evergreen in this case). > > With regard to the join collapse limit, there are many other data sets > that do not show the same sensitivity to the setting as PINES. Most > are well-served by the bug fix on > https://bugs.launchpad.net/evergreen/+bug/1527731 . There is a > not-insignificant cost to increasing this value, so there is a trade > off to be weighed. In PINES case, the increase in large-query > planning time was far outweighed by the cost of the very-slow queries > being seen, of course. You'll notice that a much smaller increase > described in the bug was all that was needed in Dan Wells' case, > though even that was not necessary with the patch. An overly-high > setting here can be detrimental to performance in general -- IOW, > setting the join collapse limit as high as I've had you set it for > PINES is definitely not a blanket optimization opportunity. > > For the ROWS adjustments, that can have a significant effect on plan > creation (positive or negative) and is directly dependent on the data > set. It was also critical in combination with the join collapse limit > change, as otherwise row estimates in the PINES data set were still > pushing plans in a sub-optimal direction. The specific setting of 300 > is only applicable to PINES. It feeds into the calculus that Postgres > uses when constructing query plans, and must be tuned to the data set > if it needs to be tuned at all. In many cases it does not need to be > adjusted, and a high setting here can cause bad plans. While it does > look like the value is essentially the number of org units in the > instance, identifying the best value is not simply a matter of setting > ROWS to approximately the number of org units you have -- there are > situations (and configurations) where that isn't the correct thing to > do, and it requires instance-specific evaluation to determine what's > best. > > Indexes are also extremely data set dependent, of course, and their > cost comes from the regular index maintenance costs. Obviously there's > no need for multiple, identical indexes, so removing those is always a > good recommendation if they are, indeed, duplicates. Adding an index > is (relative to other tuning measures) pretty cheap if you have the > drive space, but action.circulation is a pretty hot table (it changes > a lot, and quickly), so unless there is a problem with timeouts to a > site's collections vendor, I wouldn't recommend adding that specific > index. > > For shared buffers, I can't think of a situation where I would > recommend more than 32G, and even for PINES 16G is probably enough. > However, since 32G is working well for you I wouldn't change it at > this point. A large shared buffers setting causes a lot of waste in > terms of RAM (pages are loaded twice, once in shared buffers and once > in the OS page cache; wasted bookkeeping memory for the page map) and > CPU time (locking stalls, CHECKPOINT timeout I/O storms, buffer > eviction clock sweep work, load on the background writer, just to name > a few). This is definitely one of those "every instance is different" > things. > > Having said all that, disabling transparent huge pages is pretty much > critical -- it's as important as disabling zone reclaim on any NUMA > system. This was a non-issue on 9.2, became important in some > situations on 9.3, and is a huge issue for 9.4. The reasons are > complicated, but the cause is Postgres' change from System V to POSIX > shared memory. Basically, just do it. :) > > Thanks for sharing, Chris, and again, I was glad to help when needed. > > -- > Mike Rylander > | President > | Equinox Software, Inc. / The Open Source Experts > | phone: 1-877-OPEN-ILS (673-6457) > | email: mi...@esilibrary.com > | web: http://www.esilibrary.com > > > > On Wed, Feb 10, 2016 at 2:41 PM, Chris Sharp > <csh...@georgialibraries.org> wrote: > > Hello all, > > > > As watchers of the Evergreen IRC channel, know, the PINES upgrade from > > Evergreen 2.7.2/PostgreSQL 9.3 to Evergreen 2.9.1/PostgreSQL 9.4 was a bit > > bumpy. Our system slowed to a crawl on go-live day, and we were quick to > > determine that the culprit was the database. After consulting with Mike > > Rylander of Equinox Software, we implemented several tweaks to our system > > that I'll share here in hopes that they may help others in the future: > > > > The slowness appears to have been due to two problem areas: > > > > Bad Query Plans: > > > > - we upped the join_collapse_limit from the default value of 8 to 12 - Dan > > Wells discusses this possibility in his bug report about slow copy load in > > the OPAC (https://bugs.launchpad.net/evergreen/+bug/1527731). > > - we altered the actor.org_unit_descendants, adding a realistic estimate of > > the number of rows to be expected: > > ALTER FUNCTION actor.org_unit_descendants(int) ROWS 300; > > ALTER FUNCTION actor.org_unit_descendants(int,int) ROWS 300; > > - we dropped a duplicate index on the action.hold_request table that was > > apparently added to address an issue in the past > > (ahr_current_copy_open_idx) > > - we added an index to action.circulation to address an issue with slow > > collections queries: > > CREATE INDEX CONCURRENTLY circ_circ_lib_open_idx ON action.circulation > > (usr,circ_lib) WHERE xact_finish IS NULL; > > > > PostgreSQL Memory Usage: > > > > - we reduced our shared_buffers size to 32GB from 128GB - we had calculated > > the higher value from the PostgreSQL documentation > > (http://www.postgresql.org/docs/9.4/static/runtime-config-resource.html#GUC-SHARED-BUFFERS) > > recommendation of 1/4 of installed RAM (512GB in our case). We > > immediately saw a reduction of resource consumption on the server. > > - on the underlying OS, we turned off transparent huge pages (THP - see > > http://unix.stackexchange.com/questions/99154/disable-transparent-hugepages > > for details on how one might do this). Further research revealed that > > turning off THP is generally recommended by Oracle and PostgreSQL. > > > > We also followed Mike's recommendation to run the hold targeter cron job > > nightly rather than every 15 minutes, but that probably didn't have much > > to do with the root cause (aside from adding queries to the pile-up when > > slowness struck). > > > > Since we made the above changes, our database servers have increased speed > > and are performing very happily. I just wanted to share our experience > > with the Evergreen Community in hopes that others can benefit. > > > > -- > > Chris Sharp > > PINES System Administrator > > Georgia Public Library Service > > 1800 Century Place, Suite 150 > > Atlanta, Georgia 30345 > > (404) 235-7147 > > csh...@georgialibraries.org > > http://pines.georgialibraries.org/ > -- Chris Sharp PINES System Administrator Georgia Public Library Service 1800 Century Place, Suite 150 Atlanta, Georgia 30345 (404) 235-7147 csh...@georgialibraries.org http://pines.georgialibraries.org/