On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > > had to do any blocking I/O during vmtruncate before, now we have to. > > I really suspect we don't really have to, and that it would be better to > just fix the code that does that.
I'll let you discuss with Christoph and Robin about it. The moment I heard the schedule inside ->invalidate_page() requirement I reacted the same way you did. But I don't see any other real solution for XPMEM other than spin-looping for ages halting the scheduler for ages, while the ack is received from the network device. But mm_lock is required even without XPMEM. And srcu is also required without XPMEM to allow ->release to schedule (however downgrading srcu to rcu will result in a very small patch, srcu and rcu are about the same with a kernel supporting preempt=y like 2.6.26). > I literally think that mm_lock() is an unbelievable piece of utter and > horrible CRAP. > > There's simply no excuse for code like that. I think it's a great smp scalability optimization over the global lock you're proposing below. > No, the simple solution is to just make up a whole new upper-level lock, > and get that lock *first*. You can then take all the multiple locks at a > lower level in any order you damn well please. Unfortunately the lock you're talking about would be: static spinlock_t global_lock = ... There's no way to make it more granular. So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock we'd need to take that extremely wide spinlock first (and even worse, later it would become a rwsem when XPMEM is selected making the VM even slower than it already becomes when XPMEM support is selected at compile time). > And yes, it's one more lock, and yes, it serializes stuff, but: > > - that code had better not be critical anyway, because if it was, then > the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_ mmu_notifier_register can take ages. No problem. > - parallelism is overrated: it doesn't matter one effing _whit_ if > something is a hundred times more parallel, if it's also a hundred > times *SLOWER*. mmu_notifier_register is fine to be hundred times slower (preempt-rt will turn all locks in spinlocks so no problem). > And here's an admission that I lied: it wasn't *all* clearly crap. I did > like one part, namely list_del_init_rcu(), but that one should have been > in a separate patch. I'll happily apply that one. Sure, I'll split it from the rest if the mmu-notifier-core isn't merged. My objective has been: 1) add zero overhead to the VM before anybody starts a VM with kvm and still zero overhead for all other tasks except the task where the VM runs. The only exception is the unlikely(!mm->mmu_notifier_mm) check that is optimized away too when CONFIG_KVM=n. And even for that check my invalidate_page reduces the number of branches to the absolute minimum possible. 2) avoid any new cacheline collision in the fast paths to allow numa systems not to nearly-crash (mm->mmu_notifier_mm will be shared and never written, except during the first mmu_notifier_register) 3) avoid any risk to introduce regressions in 2.6.26 (the patch must be obviously safe). Even if mm_lock would be a bad idea like you say, it's order of magnitude safer even if entirely broken then messing with the VM core locking in 2.6.26. mm_lock (or whatever name you like to give it, I admit mm_lock may not be worrysome enough for people to have an idea to call it in a fast path) is going to be the real deal for the long term to allow mmu_notifier_register to serialize against invalidate_page_start/end. If I fail in 2.6.26 I'll offer maintainership to Christoph as promised, and you'll find him pushing for mm_lock to be merged (as XPMEM/GRU aren't technologies running on cellphones where your global wide spinlocks is optimized away at compile time, and he also has to deal with XPMEM where such a spinlock would need to become a rwsem as the anon_vma->sem has to be taken after it), but let's assume you're right entirely right here that mm_lock is going to be dropped and there's a better way: it's still a fine solution for 2.6.26. And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to mmu_notifier.[ch] so you don't get any pollution in the core VM, and mm_lock will be invisible to everything but anybody calling mmu_notifier_register() then and it will be trivial to remove later if you really want to add a global spinlock as there's no way to be more granular than a _global_ numa-wide spinlock taken before any i_mmap_lock/anon_vma->lock, without my mm_lock. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel