On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote: > On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote: > > On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote: > > > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote: > > > > Hi, > > > > > > > > Just got a bit of time to take another look at the replicated pagecache > > > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes > > > > gives me more confidence in the locking now; the new ->fault API makes > > > > MAP_SHARED write faults much more efficient; and a few bugs were found > > > > and fixed. > > > > > > > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding > > > > tests... > > > >
<snip> > > Hi Lee, > > Am sick with the flu for the past few days, so I haven't done much more > work here, but I'll just add some (not very useful) comments.... > > The get_page_from_freelist hang is quite strange. It would be zone->lock, > which shouldn't have too much contention... > > Replication may be putting more stress on some locks. It will cause more > tlb flushing that can not be batched well, which could cause the call_lock > to get hotter. Then i_mmap_lock is held over tlb flushing, so it will > inherit the latency from call_lock. (If this is the case, we could > potentially extend the tlb flushing API slightly to cope better with > unmapping of pages from multiple mm's, but that comes way down the track > when/if replication proves itself!). > > tlb flushing AFAIKS should not do the IPI unless it is deadling with a > multithreaded mm... does usex use threads? Yes. Apparently, there are some tests, perhaps some of the /usr/bin apps that get run repeatedly, that are multi-threaded. This job mix caught a number of races in my auto-migration patches when multi-threaded tasks race in the page fault paths. More below... > > > > I should note that I was trying to unmap all mappings to the file backed > > pages > > on internode task migration, instead of just the current task's pte's. > > However, > > I was only attempting this on pages with mapcount <= 4. So, I don't think > > I > > was looping trying to unmap pages with mapcounts of several 10s--such as I > > see > > on some page cache pages in my traces. > > Replication teardown would still have to unmap all... but that shouldn't > particularly be any worse than, say, page reclaim (except I guess that it > could occur more often). > > > > Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the > > current > > task's ptes for ALL !anon pages, regardless of mapcount. I've started the > > test > > again and will let it run over the weekend--or as long as it stays up, > > which > > ever is shorter :-). > > Ah, so it does eventually die? Any hints of why? No, doesn't die--as in panic. I was just commenting that I'd leave it running ... However [:-(], it DID hang again. The test window said that the tests ran for 62h:28m before the screen stopped updating. In another window, I was running a script to snap the replication and # file pages vmstats, along with a timestamp, every 10 minutes. That stopped reporting stats at about 7:30am on Saturday--about 14h:30m into the test. It still wrote the timestamps [date command] until around 7am this morning [Monday]--or ~62 hours into test. So, I do have ~14 hours of replication stats that I can send you or plot up... Re: the hang: again, console was scrolling soft lockups continuously. Checking the messages file, I see hangs in copy_process(), smp_call_function [as in prev test], vma_link [from mmap], ... I also see a number of NaT ["not a thing"] consumptions--ia64 specific error, probably invalid pointer deref--in swapin_readahead, which my patches hack. These might be the cause of the fork/mmap hangs. Didn't see that in the 8-9Aug runs, so it might be a result of continued operation after other hangs/problems; or a botch in the rebase to rc2-mm2. In any case, I have some work to do there... > > > > > I put a tarball with the rebased series in the Replication directory linked > > above, in case you're interested. I haven't added the patch description for > > the new patch yet, but it's pretty simple. Maybe even correct. > > > > ---- > > > > Unrelated to the lockups [I think]: > > > > I forgot to look before I rebooted, but earlier the previous evening, I > > checked > > the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 > > million > > replications and ~4.8 million "zaps" [collapse of replicated page]. That's > > around > > 98% zaps. Do we need some filter in the fault path to reduce the > > "thrashing"--if > > that's what I'm seeing. > > Yep. The current replication patch is very much only infrastructure at > this stage (and is good for stress testing). I feel sure that heuristics > and perhaps tunables would be needed to make the most of it. Yeah. I have some ideas to try... At the end of the 14.5 hours when it stopped dumping vmstats, we were at ~95% zaps. > > > > A while back I took a look at the Virtual Iron page replication patch. > > They had > > set VM_DENY_WRITE when mapping shared executable segments, and only > > replicated pages > > in those VMAs. Maybe 'DENY_WRITE isn't exactly what we want. Possibly set > > another > > flag for shared executables, if we can detect them, and any shared mapping > > that has > > no writable mappings ? > > mapping_writably_mapped would be a good one to try. That may be too > broad in some corner cases where we do want occasionally-written files > or even parts of files to be replicated, but if we were ever to enable > CONFIG_REPLICATION by default, I imagine mapping_writably_mapped would > be the default heuristic. > > Still, I appreciate the testing of the "thrashing" case, because with > the mapping_writably_mapped heuristic, it is likely that bugs could > remain lurking even in production workloads on huge systems (because > they will hardly ever get unreplicated). > > > > I'll try to remember to check the replication statistics after the currently > > running test. If the system stays up, that is. A quick look < 10 minutes > > into > > the test shows that zaps are now ~84% of replications. Also, ~47k > > replicated pages > > out of ~287K file pages. > > Yeah I guess it can be a little misleading: as time approaches infinity, > zaps will probably approach replications. But that doesn't tell you how > long a replica stayed around and usefully fed CPUs with local memory... May be able to capture that info with a more invasive patch -- e.g., add a timestamp to the page struct. I'll think about it. And, I'll keep you posted. Not sure how much time I'll be able to dedicate to this patch stream. Got a few others I need to get back to... Later, Lee - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/