On 2017-12-28 13:32, Veit Wahlich wrote:
> Hi Christoph, 
> 
> I do not have experience with the precise functioning of LXC disk storage, 
> but I assume that every operation that could cause oos applies to every 
> application running inside the LXC containers, too.
> 
> A common cause, that I suspect here, is opening a file (or block device) 
> using O_DIRECT. This flag is used to reduce I/O latency and especially bypass 
> the page cache, but it also allows buffers to be modified in-flight while 
> they are processed by e.g. DRBD. So not only DRBD is affected by this, but 
> also software RAID such as mdraid, dmraid or lvmraid, and I bet even block 
> caching such as bcache.

Are you serious?

Can someone from linbit please comment on this?

This would basically mean that DRBD is useless whenever an application
opens files with O_DIRECT!?

How could a fast path to user space render the replication of the
underlying block device useless?


> In most cases O_DIRECT is used by applications such as some DBMS to avoid 
> caching by the kernel, as they implement their own cache or do not want the 
> kernel to sacrifice memory on page caching as the data written will not be 
> used again.
> 
> So my recommendation is to check your logs/monitoring if the oos has only 
> occurred repeatedly on certain containers, and then inspect the applications' 
> configuration running inside for the use of O_DIRECT (which can usually be 
> disabled).
> If it has been occurring on all your containers, I would suspect your LXC 
> configuration itself as the cause, such as an overlay filesystem or container 
> image. 

Checking 1000s of applications in 100s of containers is NOT an option.


Regards, Christoph
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to