In my environment I have large, sequential I/O requests (e.g. 4 MB)
split into much smaller requests (e.g. 8 KB) and passed to a kernel
module, which in turn submits them to an NVMe device. This splitting
of requests seems to hurt performance badly. Because of architectural
reasons I cannot avoid the splitting of the requests by the issuer so
I have to find a way to reconstruct the original, large request.
The only way I can see to achieve this is to directly implement this
functionality in that kernel module of mine, however before proceeding
with the implementation I'd like to see some proof that merging the
requests does solve the performance problem. The trouble is that I
can't think of an easy way to do this as the NVMe device doesn't have
a block queue interface, and I can't seem to find some virtual block
layer that does it for me (e.g. dm-linear, mdadm linear, loopback).
Is there any way I can effectively merge these requests?