21.07.2014 13:37, Andrew Beekhof wrote: > > On 21 Jul 2014, at 3:09 pm, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > >> 21.07.2014 06:21, Andrew Beekhof wrote: >>> >>> On 18 Jul 2014, at 5:16 pm, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: >>> >>>> Hi Andrew, all, >>>> >>>> I have a task which seems to be easily solvable with the use of >>>> globally-unique clone: start huge number of specific virtual machines to >>>> provide a load to a connection multiplexer. >>>> >>>> I decided to look how pacemaker behaves in such setup with Dummy >>>> resource agent, and found that handling of every instance in an >>>> "initial" transition (probe+start) slows down with increase of clone-max. >>> >>> "yep" >>> >>> for non unique clones the number of probes needed is N, where N is the >>> number of nodes. >>> for unique clones, we must test every instance and node combination, or >>> N*M, where M is clone-max. >>> >>> And that's just the running of the probes... just figuring out which nodes >>> need to be >>> probed is incredibly resource intensive (run crm_simulate and it will be >>> painfully obvious). >>> >>>> >>>> F.e. for 256 instances transition took 225 seconds, ~0.88s per instance. >>>> After I added 768 more instances (set clone-max to 1024) > > How many nodes though?
Two nodes run in VMs. > Assuming 3, thats still only ~1s per operation (including the time taken to > send the operation across the network twice and update the cib). > >>>> together with >>>> increasing batch-limit to 512, transition took almost an hour (3507 >>>> seconds), or ~4.57s per added instance. Even if I take in account that >>>> monitoring of already started instances consumes some resources, last >>>> number seems to be rather big, >> >> I believe this ^ is the main point. >> If with N instances probe/start of _each_ instance takes X time slots, >> then with 4*N instances probe/start of _each_ instance takes ~5*X time >> slots. In an ideal world, I would expect it to remain constant. > > Unless you have 512 cores in the cluster, increasing the batch-limit in this > way is certainly not going to give you the results you're looking for. > Firing more tasks at a machine just ends up in producing more context > switches as the kernel tries to juggle the various tasks. > > More context switches == more CPU wasted == more time taken overall == > completely consistent with your results. Thanks to the oprofile, I was able to gain speedup by 8-9% with following patch: ========= diff --git a/crmd/te_utils.c b/crmd/te_utils.c index 2167370..c612718 100644 --- a/crmd/te_utils.c +++ b/crmd/te_utils.c @@ -374,8 +374,6 @@ te_graph_trigger(gpointer user_data) graph_rc = run_graph(transition_graph); transition_graph->batch_limit = limit; /* Restore the configured value */ - print_graph(LOG_DEBUG_3, transition_graph); - if (graph_rc == transition_active) { crm_trace("Transition not yet complete"); return TRUE; diff --git a/crmd/tengine.c b/crmd/tengine.c index 765628c..ec0e1d4 100644 --- a/crmd/tengine.c +++ b/crmd/tengine.c @@ -221,7 +221,6 @@ do_te_invoke(long long action, } trigger_graph(); - print_graph(LOG_DEBUG_2, transition_graph); if (graph_data != input->xml) { free_xml(graph_data); ========= Results this time are measured only for clean start op, after probes are done (add stopped clone, wait for probes to complete and then start clone). 256(vanilla): 09:51:50 - 09:53:17 => 1:27 = 87s => 0.33984375 s per instance 1024(vanilla): 10:17:10 - 10:34:34 => 17:24 = 1044s => 1.01953125 s per instance 1024(patched): 11:59:26 - 12:15:12 => 15:46 = 946s => 0.92382813 s per instance So, still not perfect, but better. Unfortunately, my binaries are build with optimization, so I'm not able to get call graphs yet. Also, as I run in VMs, no hardware support for oprofile is available, so results may be inaccurate a bit. Here is system-wide opreport's top for unpatched crmd with 1024 instances: CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % image name app name symbol name 429963 41.3351 no-vmlinux no-vmlinux /no-vmlinux 129533 12.4528 libxml2.so.2.7.6 libxml2.so.2.7.6 /usr/lib64/libxml2.so.2.7.6 101326 9.7411 libc-2.12.so libc-2.12.so __strcmp_sse42 42524 4.0881 libtransitioner.so.2.0.1 libtransitioner.so.2.0.1 print_synapse 37062 3.5630 libc-2.12.so libc-2.12.so malloc_consolidate 23268 2.2369 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 find_entity 21416 2.0589 libc-2.12.so libc-2.12.so _int_malloc 18950 1.8218 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 crm_element_value 17482 1.6807 libfreebl3.so libfreebl3.so /lib64/libfreebl3.so 15350 1.4757 libc-2.12.so libc-2.12.so vfprintf 15016 1.4436 libqb.so.0.16.0 libqb.so.0.16.0 /usr/lib64/libqb.so.0.16.0 13189 1.2679 bash bash /bin/bash 11375 1.0936 libc-2.12.so libc-2.12.so _int_free 10762 1.0346 libtotem_pg.so.5.0.0 libtotem_pg.so.5.0.0 /usr/lib64/libtotem_pg.so.5.0.0 10345 0.9945 libc-2.12.so libc-2.12.so _IO_default_xsputn ... And with patch: CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % image name app name symbol name 434810 46.2143 no-vmlinux no-vmlinux /no-vmlinux 125397 13.3280 libxml2.so.2.7.6 libxml2.so.2.7.6 /usr/lib64/libxml2.so.2.7.6 85259 9.0619 libc-2.12.so libc-2.12.so __strcmp_sse42 33563 3.5673 libc-2.12.so libc-2.12.so malloc_consolidate 18885 2.0072 libc-2.12.so libc-2.12.so _int_malloc 16714 1.7765 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 crm_element_value 14966 1.5907 libfreebl3.so libfreebl3.so /lib64/libfreebl3.so 14510 1.5422 libc-2.12.so libc-2.12.so vfprintf 13664 1.4523 bash bash /bin/bash 13505 1.4354 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 find_entity 12605 1.3397 libqb.so.0.16.0 libqb.so.0.16.0 /usr/lib64/libqb.so.0.16.0 10855 1.1537 libc-2.12.so libc-2.12.so _int_free 9857 1.0477 libc-2.12.so libc-2.12.so _IO_default_xsputn ... > >> Otherwise we have an issue with scalability into this direction. >> >>>> >>>> Main CPU consumer on DC while transition is running is crmd, Its memory >>>> footprint is around 85Mb, resulting CIB size together with the status >>>> section is around 2Mb, >>> >>> You said CPU and then listed RAM... >> >> Something wrong with that? :) >> That just three distinct facts. > > I was expecting quantification of the relative CPU usage. > I was also expecting the PE to have massive spikes whenever a new transition > is calculated. > >> >>> >>>> >>>> Could it be possible to optimize this use-case from your opinion with >>>> minimal efforts? Could it be optimized with just configuration? Or may >>>> it be some trivial development task, f.e replace one GList with >>>> GHashtable somewhere? >>> >>> Optimize: yes, Minimal: no >>> >>>> >>>> Sure I can look deeper and get any additional information, f.e. to get >>>> crmd profiling results if it is hard to get an answer just from the head. >>> >>> Perhaps start looking in clone_create_probe() >> >> Got it, thanks for pointer! >> >>> >>>> >>>> Best, >>>> Vladislav >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org