On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote: > On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying <ying.hu...@intel.com> wrote: > > > > Here it is, > > Thanks. > > Appended is a munged "after" list, with the "before" values in > parenthesis. It actually looks fairly similar. > > The biggest difference is that we have "mark_page_accessed()" show up > after, and not before. There was also a lot of LRU noise in the > non-profile data. I wonder if that is the reason here: the old model > of using generic_perform_write/block_page_mkwrite didn't mark the > pages accessed, and now with iomap_file_buffered_write() they get > marked as active and that screws up the LRU list, and makes us not > flush out the dirty pages well (because they are seen as active and > not good for writeback), and then you get bad memory use. > > I'm not seeing anything that looks like locking-related.
Not in that profile. I've been doing some local testing inside a 4-node fake-numa 16p/16GB RAM VM to see what I can find. I'm yet to work out how I can trigger a profile like the one that was reported (I really need to see the event traces), but in the mean time I found this.... Doing a large sequential single threaded buffered write using a 4k buffer (so single page per syscall to make the XFS IO path allocator behave the same way as in 4.7), I'm seeing a CPU profile that indicates we have a potential mapping->tree_lock issue: # xfs_io -f -c "truncate 0" -c "pwrite 0 47g" /mnt/scratch/fooey wrote 50465865728/50465865728 bytes at offset 0 47.000 GiB, 12320768 ops; 0:01:36.00 (499.418 MiB/sec and 127850.9132 ops/sec) .... 24.15% [kernel] [k] _raw_spin_unlock_irqrestore 9.67% [kernel] [k] copy_user_generic_string 5.64% [kernel] [k] _raw_spin_unlock_irq 3.34% [kernel] [k] get_page_from_freelist 2.57% [kernel] [k] mark_page_accessed 2.45% [kernel] [k] do_raw_spin_lock 1.83% [kernel] [k] shrink_page_list 1.70% [kernel] [k] free_hot_cold_page 1.26% [kernel] [k] xfs_do_writepage 1.21% [kernel] [k] __radix_tree_lookup 1.20% [kernel] [k] __wake_up_bit 0.99% [kernel] [k] __block_write_begin_int 0.95% [kernel] [k] find_get_pages_tag 0.92% [kernel] [k] cancel_dirty_page 0.89% [kernel] [k] unlock_page 0.87% [kernel] [k] clear_page_dirty_for_io 0.85% [kernel] [k] xfs_bmap_worst_indlen 0.84% [kernel] [k] xfs_file_buffered_aio_write 0.81% [kernel] [k] delay_tsc 0.78% [kernel] [k] node_dirty_ok 0.77% [kernel] [k] up_write 0.74% [kernel] [k] ___might_sleep 0.73% [kernel] [k] xfs_bmap_add_extent_hole_delay 0.72% [kernel] [k] __fget_light 0.67% [kernel] [k] add_to_page_cache_lru 0.67% [kernel] [k] __slab_free 0.63% [kernel] [k] drop_buffers 0.59% [kernel] [k] down_write 0.59% [kernel] [k] kmem_cache_alloc 0.58% [kernel] [k] iomap_write_actor 0.53% [kernel] [k] page_mapping 0.52% [kernel] [k] entry_SYSCALL_64_fastpath 0.52% [kernel] [k] __mark_inode_dirty 0.51% [kernel] [k] __block_commit_write.isra.30 0.51% [kernel] [k] xfs_file_write_iter 0.49% [kernel] [k] mark_buffer_async_write 0.47% [kernel] [k] balance_dirty_pages_ratelimited 0.47% [kernel] [k] xfs_count_page_state 0.47% [kernel] [k] page_evictable 0.46% [kernel] [k] xfs_vm_releasepage 0.46% [kernel] [k] xfs_iomap_write_delay 0.46% [kernel] [k] do_raw_spin_unlock 0.44% [kernel] [k] xfs_file_iomap_begin 0.44% [kernel] [k] xfs_map_at_offset 0.42% [kernel] [k] xfs_iext_bno_to_ext There's very little XFS showing up near the top of the profile; it's all page cache, writeback and spin lock traffic. This is a dead give-away as to the lock being contended: - 33.30% 0.01% [kernel] [k] kswapd - 4.67% kswapd - 4.69% shrink_node - 4.77% shrink_node_memcg.isra.75 - 7.38% shrink_inactive_list - 6.70% shrink_page_list - 20.02% __remove_mapping 19.90% _raw_spin_unlock_irqrestore I don't think that this is the same as what aim7 is triggering as there's no XFS write() path allocation functions near the top of the profile to speak of. Still, I don't recall seeing this before... Cheers, Dave. -- Dave Chinner da...@fromorbit.com