> On 24 Oct 2018, at 17:56, thierry bordaz <[email protected]> wrote: > > > > On 10/21/2018 03:55 AM, William Brown wrote: >> Thanks for this write up Thierry, >> >>> On 19 Oct 2018, at 17:15, thierry bordaz <[email protected]> wrote: >>> >>> Hi, >>> >>> C10K is a scalability problem that a server can face when dealing with >>> events of thousands of connections (i.e. clients) at the same time. Events >>> can be new connections, new operations on the established connections, >>> closure of connection (from client or server) >>> For 389-ds, C10K problem was resolved with a new framework Nunc-Stans [1]. >>> Nunc-stans was first enabled in RHDS 7.4 and improved/fixed in 7.5. >>> Robustness issues [2] and [3] were reported in 7.5 and it was decided to >>> disable Nunc-stans. It is not known if those issues exist or not in 7.4. >>> >>> William posted a PR to fix those two issues [4]. Nunc-stans is a complex >>> framework, with its own dynamic. Review of this PR is not easy and even a >>> careful review may not guaranty it will fix [2] and [3] and may not >>> introduce others unexpected side effects. >>> >>> From there we discussed two options (but there may be others): >>> >>> • Review and merge the PR [4], then later run some intensive tests >>> aiming to verify [2],[3] and checking the robustness in order to reenable NS >> I think this is the best solution. If NS is “not working” today, then >> merging the PR won’t make it “not work” any less ;) >> >> Given we have a reliable way to disabled it at build time, I think that >> merging, testing, and then eventually discussion and decision of enabling by >> default is the best plan. > > Hi William, > > The existing set of tests reveal a single failure[1]. That shows the PR is > really promising but at the same time that tests are crucial with NS. Like > the two robustness bugs (conn leak and deadlock) only happening is some > specific conditions. > > I think review on large NS PR is bit formal step and the most important are > tests/measure. We may agree to push the NS patches base on existing tests > results. New tests/measure (for benefits and [2] and [3]) on top of the > current PR could be the option. Let's wait for additional feedback
Sure. I will work on Viktor's comments and found issues with paged search, and I’ll commit some more time to running the tests if I can. The reason I’m pushing to “just merge and test/fix later” is that at the moment I have very little time, and I don’t want to see my work bit-rot over time, and get further and further away. It’s going to make future fixes and improvements to the space easier once it’s merged. Saying that I think it would be good to have a deadline or a list of people we want comments from. Mark, Simon, Ludwig, Viktor, yourself and myself come to mind … we need to know what our approach is, and this ties back to the “feature gating” discussion that I hope you saw. We need a way to have features we are working on, so we can test and reason about them, but without letting them out in production - and having ways to roll the back. Thanks, > > [1] https://pagure.io/389-ds-base/pull-request/49636#comment-66941 > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1608746 deadlock > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1605554 connection leaks > > best regards > thierry > >>> • Build some tests for >>> • measure the benefit of NS as [2] and [3] do not prevent some >>> performance tests >>> • identify possible reproducers for [2] and [3] >>> • create robustness and long duration NS specific tests >>> • review and merge the PR [4] >>> As PR [4] is not intended for perf improvement, the step 2.1 will impact >>> the priority according to the performance benefits. >> I think that the test plan is good. 2 and 3 will be hard to reproduce I >> think, they seem like they require complex state and interactions. The goal >> of the PR I have open is to reduce the complexity of NS integration into DS, >> so that tests for situations like 2 and 3 can be more easily created (and >> hopefully the simpler integration is going to resolve some issues). >> >> The “benefit” of NS is hard to say - because the other parts of the server >> core are still needing threading improvement, NS may help some aspects >> (connection acceptance performance for example), we won’t see all the of the >> benefits until we improve other parts. There are many bottlenecks to fix! An >> example is lib globs high use of atomics (should be COW), and the plugin >> architecture. >> >> At a theoretical level, NS will be faster as we don’t need to lock and poll >> over the connection table, but today the CT may not be the bottleneck, so NS >> may not make this “faster” just by enabling. >> >>> Comments are welcomed >>> >>> Regarding 2.1 plan we made the following notes for the test plan: >>> The benefit of Nunc-Stans can only be measure with a large number of >>> connections (i.e. client) above 1000. That means a set of clients (sometime >>> all) should keep their connection opened. Clients should run on several >>> hosts so that clients are not the bootleneck. >>> >>> For the two types of events (new connection and new operations), the >>> measurement could be >>> >>> • Event: New connections >>> • Start all clients in parallel to establish connections >>> (keeping them opened) take the duration to get 1000, 2000, ... 10000 >>> connections and check there are drop or not >>> • Establish 1000 connections and monitor during to open 100 >>> more, the same starting with 2000, 10000 >>> • Client should not run any operations during the monitoring >>> • Event: New operations >>> • Start all clients and when 1000 connections are established, >>> launch simple operations (e.g. search -s base -b "" objectclass) and >>> monitor how many of them can be handled. The same with 2000, ... 10000. >>> • response time and workqueue length could be monitored to be >>> sure the bottleneck are not the worker. >> At one point I started (but did not finish) ldclt integration with lib389. >> It would be great to have a series of stress and acceptance tests like these >> available in lib389 for our release validations, so I believe that their >> creation is valuable regardless of NS. >> >>> [1] http://www.port389.org/docs/389ds/design/nunc-stans.html >>> >>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1608746 deadlock >>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1605554 connection leaks >>> >>> [4] https://pagure.io/389-ds-base/pull-request/49636 >>> >>> _______________________________________________ >>> 389-devel mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >>> List Archives: >>> https://lists.fedoraproject.org/archives/list/[email protected] >> — >> Sincerely, >> >> William >> >> >> _______________________________________________ >> 389-devel mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >> List Archives: >> https://lists.fedoraproject.org/archives/list/[email protected] > _______________________________________________ > 389-devel mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/[email protected] — Sincerely, William _______________________________________________ 389-devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/[email protected]
