On 2017-12-28 13:32, Veit Wahlich wrote: > Hi Christoph, > > I do not have experience with the precise functioning of LXC disk storage, > but I assume that every operation that could cause oos applies to every > application running inside the LXC containers, too. > > A common cause, that I suspect here, is opening a file (or block device) > using O_DIRECT. This flag is used to reduce I/O latency and especially bypass > the page cache, but it also allows buffers to be modified in-flight while > they are processed by e.g. DRBD. So not only DRBD is affected by this, but > also software RAID such as mdraid, dmraid or lvmraid, and I bet even block > caching such as bcache.
Are you serious? Can someone from linbit please comment on this? This would basically mean that DRBD is useless whenever an application opens files with O_DIRECT!? How could a fast path to user space render the replication of the underlying block device useless? > In most cases O_DIRECT is used by applications such as some DBMS to avoid > caching by the kernel, as they implement their own cache or do not want the > kernel to sacrifice memory on page caching as the data written will not be > used again. > > So my recommendation is to check your logs/monitoring if the oos has only > occurred repeatedly on certain containers, and then inspect the applications' > configuration running inside for the use of O_DIRECT (which can usually be > disabled). > If it has been occurring on all your containers, I would suspect your LXC > configuration itself as the cause, such as an overlay filesystem or container > image. Checking 1000s of applications in 100s of containers is NOT an option. Regards, Christoph _______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user