Reviewed by: Matthew Ahrens mahr...@delphix.com
Reviewed by: George Wilson george.wil...@delphix.com
Reviewed by: Serapheim Dimitropoulos serapheim.dimi...@delphix.com
We parallelize the allocation process by creating the concept of "allocators".
There are a certain number of allocators per metaslab group, defined by the
value of a tunable at pool open time. Each allocator for a given metaslab
group has up to 2 active metaslabs; one "primary", and one "secondary". The
primary and secondary weight mean the same thing they did in in the
pre-allocator world; primary metaslabs are used for most allocations, secondary
metaslabs are used for ditto blocks being allocated in the same metaslab group.
There is also the CLAIM weight, which has been separated out from the other
weights, but that is less important to understanding the patch. The active
metaslabs for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be selected
for use by other allocators searching for new metaslabs unless all the passive
metaslabs are unsuitable for allocations. If that does happen, the allocators
will "steal" from each other to ensure that IOs don't fail until there is truly
no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken into a
separate queue for each allocator. We don't want to dramatically increase the
number of inflight IOs on low-end systems, because it can significantly
increase txg times. On the other hand, we want to ensure that there are enough
IOs for each allocator to allow for good coalescing before sending the IOs to
the disk. As a result, we take a compromise path; each allocator's alloc queue
max depth starts at a certain value for every txg. Every time an IO completes,
we increase the max depth. This should hopefully provide a good balance between
the two failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause very
similar contention when selecting IOs to allocate. This parallelization uses
the same allocator scheme as metaslab selection.
## Performance results
Performance improvements from this change can vary significantly based on the
number of CPUs in the system, whether or not the system has a NUMA
architecture, the speed of the drives, the values for the various tunables, and
the workload being performed. For an fio async sequential write workload on a
24 core NUMA system with 256 GB of RAM and 8 128 GB SSDs, there is a roughly
25% performance improvement.
## Future work
Analysis of the performance of the system with this patch applied shows that a
significant new bottleneck is the vdev disk queues, which also need to be
parallelized. Prototyping of this change has occurred, and there was a
performance improvement, but more work needs to be done before its stability
has been verified and it is ready to be upstreamed.
You can view, comment on, or merge this pull request online at:
-- Commit Summary --
* 9112 Improve allocation performance on high-end systems
-- File Changes --
M usr/src/cmd/mdb/common/modules/zfs/zfs.c (11)
M usr/src/test/zfs-tests/tests/functional/slog/slog_014_pos.ksh (14)
M usr/src/uts/common/fs/zfs/metaslab.c (513)
M usr/src/uts/common/fs/zfs/spa.c (37)
M usr/src/uts/common/fs/zfs/spa_misc.c (28)
M usr/src/uts/common/fs/zfs/sys/metaslab.h (18)
M usr/src/uts/common/fs/zfs/sys/metaslab_impl.h (79)
M usr/src/uts/common/fs/zfs/sys/spa_impl.h (12)
M usr/src/uts/common/fs/zfs/sys/vdev_impl.h (3)
M usr/src/uts/common/fs/zfs/sys/zio.h (7)
M usr/src/uts/common/fs/zfs/vdev.c (4)
M usr/src/uts/common/fs/zfs/vdev_queue.c (11)
M usr/src/uts/common/fs/zfs/vdev_removal.c (9)
M usr/src/uts/common/fs/zfs/zil.c (8)
M usr/src/uts/common/fs/zfs/zio.c (86)
-- Patch Links --
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
Powered by Topicbox: https://topicbox.com