On 08/10/2016 07:45 AM, Michael Mol wrote:
On Tuesday, August 09, 2016 05:22:22 PM james wrote:
On 08/09/2016 01:41 PM, Michael Mol wrote:
On Tuesday, August 09, 2016 01:23:57 PM james wrote:

The exception is my storage cluster, which has dirty_bytes much higher, as
it's very solidly battery backed, so I can use its oodles of memory as a
write cache, giving its kernel time to reorder writes and flush data to
disk efficiently, and letting clients very rapidly return from write
requests.
Are these TSdB (time series data) by chance?

No; my TS data is stored in a MySQL VM whose storage is host-local.


OK, so have your systematically experimented with these parameter
settings, collected and correlated the data, domain (needs) specific ?

Not with these particular settings; what they *do* is fairly straightforward,
so establishing configuration constraints is a function of knowing the capacity
and behavior of the underlying hardware; there's little need to guess.

For hypothetical example, let's say you're using a single spinning rust disk
with an enabled write cache of 64MiB. (Common enough, although you should
ensure the write cache is disabled if you find yourself at risk of poweroff. You
should be able to script that with nut, or even acpid, though.) That means the
disk could queue up 64MiB of data to be be written, and efficiently reorder
writes to flush them to disk faster. So, in that circumstance, perhaps you'd
set dirty_background_bytes to 64MiB, so that the kernel will try to feed it a
full cache's worth of data at once, giving the drive a chance to optimize its
write ordering.

For another hypothetical example, let's say you're using a parity RAID array
with three data disks and two parity disks, with a strip length of 1MiB. Now,
with parity RAID, if you modify a small bit of data, when that data gets
committed to disk, the parity bits need to get updated as well. That means
that small write requires first reading the relevant portions of all three data
disks, holding them in memory, adjusting the portion you wrote to, calculating
the parity, and writing the result out to all five disks. But if you make a
*large* write that replaces all of the data in the stripe (so, a well-placed
3MiB write, in this case), you don't have to read the disks to find out what
data was already there, and can simply write out your data and parity. In this
case, perhaps you want to set dirty_background_bytes to 3MiB (or some multiple
thereof), so that the kernel doesn't try flushing data to disk until it has a
full stripe's worth of material, and can forgo a time-consuming initial read.

For a final hypothetical example, consider SSDs. SSDs share one interesting
thing in common with parity RAID arrays...they have an optimum write size
that's a lot larger than 4KiB. When you write a small amount of data to an
SSD, it has to read an entire block of NAND flash, modify it in its own RAM,
and write that entire block back out to NAND flash. (All of this happens
internally to the SSD.) So, for efficiency, you want to give the SSD an entire
block's worth of data to write at a time, if you can. So you might set
dirty_background_bytes to the size of the SSD's block, because the fewer the
write cycles, the longer it will last. (Different model SSDs will have different
block sizes, ranging anywhere from 512KiB to 8MiB, currently.)


Ok, after reading some of the docs and postings, several time, I see how to focus in on the exact hardware on a specific system. The nice thing about clusters is they are largely identical systems, or groups of identical systems, in quantity so that helps with scaling issues.... testing specific hardware, individually, should lead to near-optimal default settings so they can bee deployed as cluster nodes, later.


As unikernels collide with my work on building up  minimized and
optimized linux clusters, my pathway forward is to use several small
clusters, where the codes/frameworks can be changed, even the
tweaked-tuned kernels and DFS and note the performance differences for
very specific domain solutions. My examples are quite similar to that
aforementioned  flight sim above, but the ordinary and uncommon
workloads of regular admin (dev/ops) work is only a different domain.

Ideas on automating the exploration of these settings
(scripts/traces/keystores) are keenly of interest to me, just so you know.

I think I missed some context, despite rereading what was already discussed.

Yea, I was thinking out loud here. just ignore this...

I use OpenRC, just so you know. I also have a motherboard with IOMMU
that is currently has questionable settings in the kernel config file. I
cannot find consensus if/how IOMMU that affects IO with the Sata HD
devices versus mm mapped peripherals.... in the context of 4.x kernel
options. I'm trying very hard here to avoid a deep dive on these issues,
so trendy strategies are most welcome, as workstation and cluster node
optimizations are all I'm really working on atm.

Honestly, I'd suggest you deep dive. An image once, with clarity, will
last
you a lot longer than ongoing fuzzy and trendy images from people whose
hardware and workflow is likely to be different from yours.

The settings I provided should be absolutely fine for most use cases. Only
exception would be mobile devices with spinning rust, but those are
getting
rarer and rarer...

I did a quick test with games-arcade/xgalaga. It's an old, quirky game
with sporadic lag variations. On a workstation with 32G ram and (8) 4GHz
64bit cores, very lightly loaded, there is no reason for in game lag.
Your previous settings made it much better and quicker the vast majority
of the time; but not optimal (always responsive). Experiences tell me if
I can tweak a system so that that game stays responsive whilst the
application(s) mix is concurrently running then the  quick
test+parameter settings is reasonably well behaved. So thats becomes a
baseline for further automated tests and fine tuning for a system under
study.

What kind of storage are you running on? What filesystem? If you're still
hitting swap, are you using a swap file or a swap partition?

The system I mostly referenced, rarely hits swap in days of uptime. It's the keyboard latency, while playing the game, that I try to tune away, while other codes are running. I try very hard to keep codes from swapping out, cause ultimately I'm most interested in clusters that keep everything running (in memory). AkA ultimate utilization of Apache-Spark and other "in-memory" techniques.


Combined codes running simultaneously never hits the HD (no swappiness) but still there is keyboard lag. Not that it is actually affecting the running codes to any appreciable degree, but it is a test I run so that the cluster nodes will benefit from still being (low latency) quickly attentive to interactions with the cluster master processes, regardless of workloads on the nodes. Sure its not totally accurate, but so far this semantical approach, is pretty darn close. It's not part of this conversation (on VM etc) but ultimately getting this right solves one of the biggest problems for building any cluster; that is workload invocation, shedding and management to optimize resource utilization, regardless of the orchestration(s) used to manage the nodes. Swapping to disc is verbotim, in my (ultimate) goals and target scenarios.

No worries, you have given me enough info and ideas to move forward with testing and tuning. I'm going to evolve these into more precisely controlled and monitored experiments, noting exact hardware differences; that should complete the tuning of the Memory Management tasks, within acceptable confine . Then automate it for later checking on cluster test runs with various hardware setups. Eventually these test will be extended to a variety of memory and storage hardware, once the techniques are automated. No worries, I now have enough ideas and details (thanks to you) to move forward.


Perhaps Zabbix +TSdB can get me further down the pathway.  Time
sequenced and analyzed data is over kill for this (xgalaga) test, but
those coalesced test-vectors  will be most useful for me as I seek a
gentoo centric pathway for low latency clusters (on bare metal).

If you're looking to avoid Zabbix interfering with your performance, you'll
want the Zabbix server and web interface on a machine separate from the
machines you're trying to optimize.

agreed.

Thanks Mike,
James


Reply via email to