This is the test scenario:  6 nodes, N+1 setup where N=5.  5 active nodes each 
create one checkpoint (asynchronous) with 40k sections, and about 1k data per 
section.  The standby node opens all 5 other checkpoints, and uses the 
hotstandby callback to read in the data from all of them.  This standby node 
therefore is reading 200k sections of 1k each.  The active nodes are constantly 
writing 1k of data into all 40k sections, so the standby is constantly getting 
updates for 200k sections.

    Performance w/out this patch:

1) Under this load creating 40k sections takes anywhere from 98 seconds to 337 
seconds.  This includes timeouts returned by saCkptSectionCreate on some of the 
active nodes.

2) Writes of 1k into 20k sections (not even the full 40k) at once can take up 
to 92 seconds, and rarely less than 10 seconds.

3) CPU load for ckptnd on the standby reading in the 200k sections is 100%.

4) The hot standby callback on the standby cannot keep up with all the data 
coming from the other nodes.  It is taking minutes for updates to reach the 
standby.

    Performance with the patch:

1) Creation of 40k sections under this load takes 3 or 4 seconds.  No timeouts 
at all.

2) Writing 1k into 40k sections at once takes 1 second or less.

3) CPU load for ckptnd on the standby node is now 35%.

4) The hot standby callback on the standby easily keeps up with all the data 
being checkpointed from the other nodes.  Updates reach immediately even under 
this heavy load.



---

** [tickets:#770] CKPT service performance enhancements**

**Status:** review
**Milestone:** future
**Created:** Thu Feb 06, 2014 07:01 PM UTC by Alex Jones
**Last Updated:** Tue May 20, 2014 09:29 PM UTC
**Owner:** Alex Jones

The checkpoint service has some major performance problems when using a lot of 
sections (greater than 5k).

Attached is a patch which addresses the following problems:

1) section id database is implemented as a linked list, so searching for a 
section id takes a long time (make it a C++ STL map)
2) MAX_SYNC_TRANSFER_SIZE is too large, and causes MDS timeouts
3) SectionCreate message should be asynchronous when ACTIVE_REPLICA is specified

There is still one more problem that I haven't delved into yet.

When there are 5 active checkpoints in the cluster, each with 40k sections and 
1k worth of data in each section, and a standby node opens all of them (200k 
total sections), and iterates through all of them, while those checkpoints are 
being actively written, it takes on average 15 seconds to iterate through each 
checkpoint, and sometimes the iteration function returns TIMEOUT.  Hopefully, 
this can be improved.


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to