This is the test scenario: 6 nodes, N+1 setup where N=5. 5 active nodes each
create one checkpoint (asynchronous) with 40k sections, and about 1k data per
section. The standby node opens all 5 other checkpoints, and uses the
hotstandby callback to read in the data from all of them. This standby node
therefore is reading 200k sections of 1k each. The active nodes are constantly
writing 1k of data into all 40k sections, so the standby is constantly getting
updates for 200k sections.
Performance w/out this patch:
1) Under this load creating 40k sections takes anywhere from 98 seconds to 337
seconds. This includes timeouts returned by saCkptSectionCreate on some of the
active nodes.
2) Writes of 1k into 20k sections (not even the full 40k) at once can take up
to 92 seconds, and rarely less than 10 seconds.
3) CPU load for ckptnd on the standby reading in the 200k sections is 100%.
4) The hot standby callback on the standby cannot keep up with all the data
coming from the other nodes. It is taking minutes for updates to reach the
standby.
Performance with the patch:
1) Creation of 40k sections under this load takes 3 or 4 seconds. No timeouts
at all.
2) Writing 1k into 40k sections at once takes 1 second or less.
3) CPU load for ckptnd on the standby node is now 35%.
4) The hot standby callback on the standby easily keeps up with all the data
being checkpointed from the other nodes. Updates reach immediately even under
this heavy load.
---
** [tickets:#770] CKPT service performance enhancements**
**Status:** review
**Milestone:** future
**Created:** Thu Feb 06, 2014 07:01 PM UTC by Alex Jones
**Last Updated:** Tue May 20, 2014 09:29 PM UTC
**Owner:** Alex Jones
The checkpoint service has some major performance problems when using a lot of
sections (greater than 5k).
Attached is a patch which addresses the following problems:
1) section id database is implemented as a linked list, so searching for a
section id takes a long time (make it a C++ STL map)
2) MAX_SYNC_TRANSFER_SIZE is too large, and causes MDS timeouts
3) SectionCreate message should be asynchronous when ACTIVE_REPLICA is specified
There is still one more problem that I haven't delved into yet.
When there are 5 active checkpoints in the cluster, each with 40k sections and
1k worth of data in each section, and a standby node opens all of them (200k
total sections), and iterates through all of them, while those checkpoints are
being actively written, it takes on average 15 seconds to iterate through each
checkpoint, and sometimes the iteration function returns TIMEOUT. Hopefully,
this can be improved.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls.
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets