On Thu, Jun 13, 2019 at 05:21:12PM -0400, Kent Overstreet wrote:
> On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote:
> > There are definitely workloads that require multiple threads doing 
> > non-overlapping
> > writes to a single file in HPC.  This is becoming an increasingly common 
> > problem
> > as the number of cores on a single client increase, since there is 
> > typically one
> > thread per core trying to write to a shared file.  Using multiple files 
> > (one per
> > core) is possible, but that has file management issues for users when there 
> > are a
> > million cores running on the same job/file (obviously not on the same 
> > client node)
> > dumping data every hour.
> 
> Mixed buffered and O_DIRECT though? That profile looks like just buffered IO 
> to
> me.
> 
> > We were just looking at this exact problem last week, and most of the 
> > threads are
> > spinning in grab_cache_page_nowait->add_to_page_cache_lru() and 
> > set_page_dirty()
> > when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads 
> > are
> > writing O_DIRECT instead of buffered).  Flame graph is attached for 
> > 16-thread case,
> > but high-end systems today easily have 2-4x that many cores.
> 
> Yeah I've been spending some time on buffered IO performance too - 4k page
> overhead is a killer.
> 
> bcachefs has a buffered write path that looks up multiple pages at a time and
> locks them, and then copies the data to all the pages at once (I stole the 
> idea
> from btrfs). It was a very significant performance increase.

Careful with that - locking multiple pages is also a deadlock vector
that triggers unexpectedly when something conspires to lock pages in
non-ascending order. e.g.

64081362e8ff mm/page-writeback.c: fix range_cyclic writeback vs writepages 
deadlock

The fs/iomap.c code avoids this problem by mapping the IO first,
then iterating pages one at a time until the mapping is consumed,
then it gets another mapping. It also avoids needing to put a page
array on stack....

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Reply via email to