Re: preemption and rwsems (was: Re: missing madvise functionality)
* Andrew Morton <[EMAIL PROTECTED]> wrote: > > i've attached an updated version of trace-it.c, which will turn this > > off itself, using a sysctl. I also made WAKEUP_TIMING default-off. > > ok. http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of > > taskset -c 0 ./jakubs-test-app > > while the system was doing the 150,000 context switches/sec. > > It isn't very interesting. this shows an idle CPU#7: you should taskset -c 0 trace-it too - it only traces the current CPU by default. (there's the /proc/sys/kernel/trace_all_cpus flag to trace all cpus, but in this case we really want the trace of CPU#0) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
On Fri, 6 Apr 2007 11:08:22 +0200 Ingo Molnar <[EMAIL PROTECTED]> wrote: > * Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > getting a good trace of it is easy: pick up the latest -rt kernel > > > from: > > > > > > http://redhat.com/~mingo/realtime-preempt/ > > > > > > enable EVENT_TRACING in that kernel, run the workload and do: > > > > > > scripts/trace-it > to-ingo.txt > > > > > > and send me the output. > > > > Did that - no output was generated. config at > > http://userweb.kernel.org/~akpm/config-akpm2.txt > > sorry, i forgot to mention that you should turn off > CONFIG_WAKEUP_TIMING. > > i've attached an updated version of trace-it.c, which will turn this off > itself, using a sysctl. I also made WAKEUP_TIMING default-off. ok. http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of taskset -c 0 ./jakubs-test-app while the system was doing the 150,000 context switches/sec. It isn't very interesting. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
* Andrew Morton <[EMAIL PROTECTED]> wrote: > > getting a good trace of it is easy: pick up the latest -rt kernel > > from: > > > > http://redhat.com/~mingo/realtime-preempt/ > > > > enable EVENT_TRACING in that kernel, run the workload and do: > > > > scripts/trace-it > to-ingo.txt > > > > and send me the output. > > Did that - no output was generated. config at > http://userweb.kernel.org/~akpm/config-akpm2.txt sorry, i forgot to mention that you should turn off CONFIG_WAKEUP_TIMING. i've attached an updated version of trace-it.c, which will turn this off itself, using a sysctl. I also made WAKEUP_TIMING default-off. > I did get an interesting dmesg spew: > http://userweb.kernel.org/~akpm/dmesg-akpm2.txt yeah, it's stack footprint measurement/instrumentation. It's particularly effective at tracking the worst-case stack footprint if you have FUNCTION_TRACING enabled - because in that case the kernel measures the stack's size at every function entry point. It does a maximum search so after bootup (in search of the 'largest' stack frame) so it's a bit verbose, but gets alot rarer later on. If it bothers you then disable: CONFIG_DEBUG_STACKOVERFLOW=y it could interfere with getting a quality scheduling trace anyway. Ingo /* * Copyright (C) 2005, Ingo Molnar <[EMAIL PROTECTED]> * * user-triggered tracing. * * The -rt kernel has a built-in kernel tracer, which will trace * all kernel function calls (and a couple of special events as well), * by using a build-time gcc feature that instruments all kernel * functions. * * The tracer is highly automated for a number of latency tracing purposes, * but it can also be switched into 'user-triggered' mode, which is a * half-automatic tracing mode where userspace apps start and stop the * tracer. This file shows a dumb example how to turn user-triggered * tracing on, and how to start/stop tracing. Note that if you do * multiple start/stop sequences, the kernel will do a maximum search * over their latencies, and will keep the trace of the largest latency * in /proc/latency_trace. The maximums are also reported to the kernel * log. (but can also be read from /proc/sys/kernel/preempt_max_latency) * * For the tracer to be activated, turn on CONFIG_EVENT_TRACING * in the .config, rebuild the kernel and boot into it. The trace will * get _alot_ more verbose if you also turn on CONFIG_FUNCTION_TRACING, * every kernel function call will be put into the trace. Note that * CONFIG_FUNCTION_TRACING has significant runtime overhead, so you dont * want to use it for performance testing :) */ #include #include #include #include #include #include #include int main (int argc, char **argv) { int ret; if (getuid() != 0) { fprintf(stderr, "needs to run as root.\n"); exit(1); } ret = system("cat /proc/sys/kernel/mcount_enabled >/dev/null 2>/dev/null"); if (ret) { fprintf(stderr, "CONFIG_LATENCY_TRACING not enabled?\n"); exit(1); } system("echo 1 > /proc/sys/kernel/trace_user_triggered"); system("[ -e /proc/sys/kernel/wakeup_timing ] && echo 0 > /proc/sys/kernel/wakeup_timing"); system("echo 1 > /proc/sys/kernel/trace_enabled"); system("echo 1 > /proc/sys/kernel/mcount_enabled"); system("echo 0 > /proc/sys/kernel/trace_freerunning"); system("echo 0 > /proc/sys/kernel/trace_print_on_crash"); system("echo 0 > /proc/sys/kernel/trace_verbose"); system("echo 0 > /proc/sys/kernel/preempt_thresh 2>/dev/null"); system("echo 0 > /proc/sys/kernel/preempt_max_latency 2>/dev/null"); // start tracing if (prctl(0, 1)) { fprintf(stderr, "trace-it: couldnt start tracing!\n"); return 1; } usleep(100); if (prctl(0, 0)) { fprintf(stderr, "trace-it: couldnt stop tracing!\n"); return 1; } system("echo 0 > /proc/sys/kernel/trace_user_triggered"); system("echo 0 > /proc/sys/kernel/trace_enabled"); system("cat /proc/latency_trace"); return 0; }
Re: preemption and rwsems (was: Re: missing madvise functionality)
* Andrew Morton [EMAIL PROTECTED] wrote: getting a good trace of it is easy: pick up the latest -rt kernel from: http://redhat.com/~mingo/realtime-preempt/ enable EVENT_TRACING in that kernel, run the workload and do: scripts/trace-it to-ingo.txt and send me the output. Did that - no output was generated. config at http://userweb.kernel.org/~akpm/config-akpm2.txt sorry, i forgot to mention that you should turn off CONFIG_WAKEUP_TIMING. i've attached an updated version of trace-it.c, which will turn this off itself, using a sysctl. I also made WAKEUP_TIMING default-off. I did get an interesting dmesg spew: http://userweb.kernel.org/~akpm/dmesg-akpm2.txt yeah, it's stack footprint measurement/instrumentation. It's particularly effective at tracking the worst-case stack footprint if you have FUNCTION_TRACING enabled - because in that case the kernel measures the stack's size at every function entry point. It does a maximum search so after bootup (in search of the 'largest' stack frame) so it's a bit verbose, but gets alot rarer later on. If it bothers you then disable: CONFIG_DEBUG_STACKOVERFLOW=y it could interfere with getting a quality scheduling trace anyway. Ingo /* * Copyright (C) 2005, Ingo Molnar [EMAIL PROTECTED] * * user-triggered tracing. * * The -rt kernel has a built-in kernel tracer, which will trace * all kernel function calls (and a couple of special events as well), * by using a build-time gcc feature that instruments all kernel * functions. * * The tracer is highly automated for a number of latency tracing purposes, * but it can also be switched into 'user-triggered' mode, which is a * half-automatic tracing mode where userspace apps start and stop the * tracer. This file shows a dumb example how to turn user-triggered * tracing on, and how to start/stop tracing. Note that if you do * multiple start/stop sequences, the kernel will do a maximum search * over their latencies, and will keep the trace of the largest latency * in /proc/latency_trace. The maximums are also reported to the kernel * log. (but can also be read from /proc/sys/kernel/preempt_max_latency) * * For the tracer to be activated, turn on CONFIG_EVENT_TRACING * in the .config, rebuild the kernel and boot into it. The trace will * get _alot_ more verbose if you also turn on CONFIG_FUNCTION_TRACING, * every kernel function call will be put into the trace. Note that * CONFIG_FUNCTION_TRACING has significant runtime overhead, so you dont * want to use it for performance testing :) */ #include unistd.h #include stdio.h #include stdlib.h #include signal.h #include sys/wait.h #include sys/prctl.h #include linux/unistd.h int main (int argc, char **argv) { int ret; if (getuid() != 0) { fprintf(stderr, needs to run as root.\n); exit(1); } ret = system(cat /proc/sys/kernel/mcount_enabled /dev/null 2/dev/null); if (ret) { fprintf(stderr, CONFIG_LATENCY_TRACING not enabled?\n); exit(1); } system(echo 1 /proc/sys/kernel/trace_user_triggered); system([ -e /proc/sys/kernel/wakeup_timing ] echo 0 /proc/sys/kernel/wakeup_timing); system(echo 1 /proc/sys/kernel/trace_enabled); system(echo 1 /proc/sys/kernel/mcount_enabled); system(echo 0 /proc/sys/kernel/trace_freerunning); system(echo 0 /proc/sys/kernel/trace_print_on_crash); system(echo 0 /proc/sys/kernel/trace_verbose); system(echo 0 /proc/sys/kernel/preempt_thresh 2/dev/null); system(echo 0 /proc/sys/kernel/preempt_max_latency 2/dev/null); // start tracing if (prctl(0, 1)) { fprintf(stderr, trace-it: couldnt start tracing!\n); return 1; } usleep(100); if (prctl(0, 0)) { fprintf(stderr, trace-it: couldnt stop tracing!\n); return 1; } system(echo 0 /proc/sys/kernel/trace_user_triggered); system(echo 0 /proc/sys/kernel/trace_enabled); system(cat /proc/latency_trace); return 0; }
Re: preemption and rwsems (was: Re: missing madvise functionality)
On Fri, 6 Apr 2007 11:08:22 +0200 Ingo Molnar [EMAIL PROTECTED] wrote: * Andrew Morton [EMAIL PROTECTED] wrote: getting a good trace of it is easy: pick up the latest -rt kernel from: http://redhat.com/~mingo/realtime-preempt/ enable EVENT_TRACING in that kernel, run the workload and do: scripts/trace-it to-ingo.txt and send me the output. Did that - no output was generated. config at http://userweb.kernel.org/~akpm/config-akpm2.txt sorry, i forgot to mention that you should turn off CONFIG_WAKEUP_TIMING. i've attached an updated version of trace-it.c, which will turn this off itself, using a sysctl. I also made WAKEUP_TIMING default-off. ok. http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of taskset -c 0 ./jakubs-test-app while the system was doing the 150,000 context switches/sec. It isn't very interesting. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
* Andrew Morton [EMAIL PROTECTED] wrote: i've attached an updated version of trace-it.c, which will turn this off itself, using a sysctl. I also made WAKEUP_TIMING default-off. ok. http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of taskset -c 0 ./jakubs-test-app while the system was doing the 150,000 context switches/sec. It isn't very interesting. this shows an idle CPU#7: you should taskset -c 0 trace-it too - it only traces the current CPU by default. (there's the /proc/sys/kernel/trace_all_cpus flag to trace all cpus, but in this case we really want the trace of CPU#0) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Ulrich Drepper wrote: Nick Piggin wrote: Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's kernels using down_write(mmap_sem) for MADV_DONTNEED is better than mmap/mprotect, which have more fundamental locking requirements, more overhead and no benefits (except debugging, I suppose). It's a tiny bit faster, see http://people.redhat.com/drepper/dontneed.png I just ran it once so the graph is not smooth. This is on a UP dual core machine. Maybe tomorrow I'll turn on the big 4p machine. Hmm, I saw an improvement, but that was just on a raw syscall test with a single page chunk. Real-world use I guess will get progressively less dramatic as other overheads start being introduced. Multi-thread performance probably won't get a whole lot better (it does eliminate 1 down_write(mmap_sem), but one remains) until you use my madvise patch. I would have to see dramatically different results on the big machine to make me change the libc code. The reason is that there is a big drawback. So far, when we allocate a new arena, we allocate address space with PROT_NONE and only when we need memory the protection is changed to PROT_READ|PROT_WRITE. This is the advantage of catching wild pointer accesses. Sure, yes. And I guess you'd always want to keep that options around as a debugging aid. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Nick Piggin wrote: > Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's > kernels using down_write(mmap_sem) for MADV_DONTNEED is better than > mmap/mprotect, which have more fundamental locking requirements, more > overhead and no benefits (except debugging, I suppose). It's a tiny bit faster, see http://people.redhat.com/drepper/dontneed.png I just ran it once so the graph is not smooth. This is on a UP dual core machine. Maybe tomorrow I'll turn on the big 4p machine. I would have to see dramatically different results on the big machine to make me change the libc code. The reason is that there is a big drawback. So far, when we allocate a new arena, we allocate address space with PROT_NONE and only when we need memory the protection is changed to PROT_READ|PROT_WRITE. This is the advantage of catching wild pointer accesses. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: missing madvise functionality
Ulrich Drepper wrote: In case somebody wants to play around with Rik patch or another madvise-based patch, I have x86-64 glibc binaries which can use it: http://people.redhat.com/drepper/rpms These are based on the latest Fedora rawhide version. They should work on older systems, too, but you screw up your updates. Use them only if you know what you do. By default madvise(MADV_DONTNEED) is used. With the environment variable Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's kernels using down_write(mmap_sem) for MADV_DONTNEED is better than mmap/mprotect, which have more fundamental locking requirements, more overhead and no benefits (except debugging, I suppose). MADV_DONTNEED is twice as fast in single threaded performance, and an order of magnitude faster for multiple threads, when MADV_DONTNEED only takes mmap_sem for read. Do you plan to include this change in general glibc releases? Maybe it will make google malloc obsolete? ;) (I don't suppose you'd be able to get any tests done, Andrew?) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Rik van Riel wrote: Nick Piggin wrote: Oh, also: something like this patch would help out MADV_DONTNEED, as it means it can run concurrently with page faults. I think the locking will work (but needs forward porting). Ironically, your patch decreases throughput on my quad core test system, with Jakub's test case. MADV_DONTNEED, my patch, 1 loops (14k context switches/second) real0m34.890s user0m17.256s sys 0m29.797s MADV_DONTNEED, my patch & your patch, 1 loops (50 context switches/second) real1m8.321s user0m20.840s sys 1m55.677s I suspect it's moving the contention onto the page table lock, in zap_pte_range(). I guess that the thread private memory areas must be living right next to each other, in the same page table lock regions :) For more real world workloads, like the MySQL sysbench one, I still suspect that your patch would improve things. I think it definitely would, because the app will be wanting to do other things with mmap_sem as well (like futexes *grumble*). Also, the test case is allocating and freeing 512K chunks, which I think would be on the high side of typical. You have 32 threads for 4 CPUs, so then it would actually make sense to context switch on mmap_sem write lock rather than spin on ptl. But the kernel doesn't know that. Testing with a small chunk size or thread == CPUs I think would show a swing toward my patch. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Andrew Morton wrote: #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS I wonder which way you're using, and whether using the other way changes things. I'm using the default Fedora config file, which has NR_CPUS defined to 64 and CONFIG_SPLIT_PTLOCK_CPUS to 4, so I am using the split locks. However, I suspect that each 512kB malloced area will share one page table lock with 4 others, so some contention is to be expected. For more real world workloads, like the MySQL sysbench one, I still suspect that your patch would improve things. Time to move back to debugging other stuff, though. Andrew, it would be nice if our patches could cook in -mm for a while. Want me to change anything before submitting? umm. I took a quick squint at a patch from you this morning and it looked OK to me. Please send the finalish thing when it is fully baked and performance-tested in the various regions of operation, thanks. Will do. Ulrich has a test version of glibc available that uses MADV_DONTNEED for free(3), that should test this thing nicely. I'll run some tests with that when I get the time, hopefully next week. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, 05 Apr 2007 14:38:30 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Nick Piggin wrote: > > > Oh, also: something like this patch would help out MADV_DONTNEED, as it > > means it can run concurrently with page faults. I think the locking will > > work (but needs forward porting). > > Ironically, your patch decreases throughput on my quad core > test system, with Jakub's test case. > > MADV_DONTNEED, my patch, 1 loops (14k context switches/second) > > real0m34.890s > user0m17.256s > sys 0m29.797s > > > MADV_DONTNEED, my patch & your patch, 1 loops (50 context > switches/second) > > real1m8.321s > user0m20.840s > sys 1m55.677s > > I suspect it's moving the contention onto the page table lock, > in zap_pte_range(). I guess that the thread private memory > areas must be living right next to each other, in the same > page table lock regions :) Remember that we have two different ways of doing that locking: #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS /* * We tuck a spinlock to guard each pagetable page into its struct page, * at page->private, with BUILD_BUG_ON to make sure that this will not * overflow into the next struct page (as it might with DEBUG_SPINLOCK). * When freeing, reset page->mapping so free_pages_check won't complain. */ #define __pte_lockptr(page) &((page)->ptl) #define pte_lock_init(_page)do {\ spin_lock_init(__pte_lockptr(_page)); \ } while (0) #define pte_lock_deinit(page) ((page)->mapping = NULL) #define pte_lockptr(mm, pmd)({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));}) #else /* * We use mm->page_table_lock to guard all pagetable pages of the mm. */ #define pte_lock_init(page) do {} while (0) #define pte_lock_deinit(page) do {} while (0) #define pte_lockptr(mm, pmd)({(void)(pmd); &(mm)->page_table_lock;}) #endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */ I wonder which way you're using, and whether using the other way changes things. > For more real world workloads, like the MySQL sysbench one, > I still suspect that your patch would improve things. > > Time to move back to debugging other stuff, though. > > Andrew, it would be nice if our patches could cook in -mm > for a while. Want me to change anything before submitting? umm. I took a quick squint at a patch from you this morning and it looked OK to me. Please send the finalish thing when it is fully baked and performance-tested in the various regions of operation, thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
On Thu, 5 Apr 2007 21:11:29 +0200 Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * David Howells <[EMAIL PROTECTED]> wrote: > > > But short of recording the lock sequence, I don't think there's anyway > > to find out for sure. printk probably won't cut it as a recording > > mechanism because its overheads are too great. > > getting a good trace of it is easy: pick up the latest -rt kernel from: > > http://redhat.com/~mingo/realtime-preempt/ > > enable EVENT_TRACING in that kernel, run the workload > and do: > > scripts/trace-it > to-ingo.txt > > and send me the output. Did that - no output was generated. config at http://userweb.kernel.org/~akpm/config-akpm2.txt > It will be large but interesting. That should > get us a whole lot closer to what happens. A (much!) more finegrained > result would be to also enable FUNCTION_TRACING and to do: > > echo 1 > /proc/sys/kernel/mcount_enabled > > before running trace-it. Did that - still no output. I did get an interesting dmesg spew: http://userweb.kernel.org/~akpm/dmesg-akpm2.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
On Thu, 05 Apr 2007 13:48:58 +0100 David Howells <[EMAIL PROTECTED]> wrote: > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > > What we effectively have is 32 threads on a single CPU all doing > > > > for (ever) { > > down_write() > > up_write() > > down_read() > > up_read(); > > } > > That's not quite so. In that test program, most loops do two d/u writes and > then a slew of d/u reads with virtually no delay between them. One of the > write-locked periods possibly lasts a relatively long time (it frees a bunch > of pages), and the read-locked periods last a potentially long time (have to > allocate a page). Whatever. I think it is still the case that the queueing behaviour of rwsems causes us to get into this abababababab scenario. And a single, sole, solitary cond_resched() is sufficient to trigger the whole process happening, and once it has started, it is sustained. > If they weren't, you'd have to expect writer starvation in this situation. As > it is, you're guaranteed progress on all threads. > > > CONFIG_PREEMPT_VOLUNTARY=y > > Which means the periods of lock-holding can be extended by preemption of the > lock holder(s), making the whole situation that much worse. You have to > remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex. Of course - the same thing happens with CONFIG_PREEMPT=y. > > I run it all on a single CPU under `taskset -c 0' on the 8-way and it still > > causes 160,000 context switches per second and takes 9.5 seconds (after > > s/10/1000). > > How about if you have a UP kernel? (ie: spinlocks -> nops) dunno. > > the context switch rate falls to zilch and total runtime falls to 6.4 > > seconds. > > I presume you don't mean literally zero. I do. At least, I was unable to discern any increase in the context-switch column in the `vmstat 1' output. > > If that cond_resched() was not there, none of this would ever happen - each > > thread merrily chugs away doing its ups and downs until it expires its > > timeslice. Interesting, in a sad sort of way. > > The trouble is, I think, that you spend so much more time holding (or > attempting to hold) locks than not, and preemption just exacerbates things. No. Preemption *triggers* things. We're talking about an increase in context switch rate by a factor of at least 10,000. Something changed in a fundamental way. > I suspect that the reason the problem doesn't seem so obvious when you've got > 8 CPUs crunching their way through at once is probably because you can make > progress on several read loops simultaneously fast enough that the preemption > is lost in the things having to stop to give everyone writelocks. The context switch rate is enormous on SMP on all kernel configs. Perhaps a better way of looking at it is to observe that the special case of a single processor running a non-preemptible kernel simply got lucky. > But short of recording the lock sequence, I don't think there's anyway to find > out for sure. printk probably won't cut it as a recording mechanism because > its overheads are too great. I think any code sequence which does for ( ; ; ) { down_write() up_write() down_read() up_read() } is vulnerable to the artifact which I described. I don't think we can (or should) do anything about it at the lock implementation level. It's more a matter of being aware of the possible failure modes of rwsems, and being more careful to avoid that situation in the code which uses rwsems. And, of course, being careful about when and where we use rwsems as opposed to other types of locks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
* David Howells <[EMAIL PROTECTED]> wrote: > But short of recording the lock sequence, I don't think there's anyway > to find out for sure. printk probably won't cut it as a recording > mechanism because its overheads are too great. getting a good trace of it is easy: pick up the latest -rt kernel from: http://redhat.com/~mingo/realtime-preempt/ enable EVENT_TRACING in that kernel, run the workload and do: scripts/trace-it > to-ingo.txt and send me the output. It will be large but interesting. That should get us a whole lot closer to what happens. A (much!) more finegrained result would be to also enable FUNCTION_TRACING and to do: echo 1 > /proc/sys/kernel/mcount_enabled before running trace-it. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Nick Piggin wrote: Oh, also: something like this patch would help out MADV_DONTNEED, as it means it can run concurrently with page faults. I think the locking will work (but needs forward porting). Ironically, your patch decreases throughput on my quad core test system, with Jakub's test case. MADV_DONTNEED, my patch, 1 loops (14k context switches/second) real0m34.890s user0m17.256s sys 0m29.797s MADV_DONTNEED, my patch & your patch, 1 loops (50 context switches/second) real1m8.321s user0m20.840s sys 1m55.677s I suspect it's moving the contention onto the page table lock, in zap_pte_range(). I guess that the thread private memory areas must be living right next to each other, in the same page table lock regions :) For more real world workloads, like the MySQL sysbench one, I still suspect that your patch would improve things. Time to move back to debugging other stuff, though. Andrew, it would be nice if our patches could cook in -mm for a while. Want me to change anything before submitting? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Jakub Jelinek wrote: + /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */ case MADV_DONTNEED: + case MADV_FREE: error = madvise_dontneed(vma, prev, start, end); break; I think you should only use the new behavior for madvise MADV_FREE, not for MADV_DONTNEED. I will. However, we need to double-use MADV_DONTNEED in this patch for now, so Ulrich's test glibc can be used easily :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
In case somebody wants to play around with Rik patch or another madvise-based patch, I have x86-64 glibc binaries which can use it: http://people.redhat.com/drepper/rpms These are based on the latest Fedora rawhide version. They should work on older systems, too, but you screw up your updates. Use them only if you know what you do. By default madvise(MADV_DONTNEED) is used. With the environment variable MALLOC_MADVISE one can select a different hint. The value of the envvar must be the number of that other hint. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: missing madvise functionality
Andrew Morton wrote: On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: Rik van Riel wrote: MADV_DONTNEED, unpatched, 1000 loops real0m13.672s user0m1.217s sys 0m45.712s MADV_DONTNEED, with patch, 1000 loops real0m4.169s user0m2.033s sys 0m3.224s I just noticed something fun with these numbers. Without the patch, the system (a quad core CPU) is 10% idle. With the patch, it is 66% idle - presumably I need Nick's mmap_sem patch. However, despite being 66% idle, the test still runs over 3 times as fast! Please quote the context switch rate when testing this stuff (I use vmstat 1). I've seen it vary by a factor of 10,000 depending upon what's happening. About context switches 14000 per second. I'll go compile in Nick's patch to see if that makes things go faster. I expect it will. procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 1 0 0 965232 250024 37084800 0 0 1026 13914 13 21 67 0 0 1 0 0 965232 250024 37084800 0 0 1018 14654 12 20 68 0 0 1 0 0 965232 250024 37084800 0 0 1023 14006 12 21 67 0 0 -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
Andrew Morton <[EMAIL PROTECTED]> wrote: > > What we effectively have is 32 threads on a single CPU all doing > > for (ever) { > down_write() > up_write() > down_read() > up_read(); > } That's not quite so. In that test program, most loops do two d/u writes and then a slew of d/u reads with virtually no delay between them. One of the write-locked periods possibly lasts a relatively long time (it frees a bunch of pages), and the read-locked periods last a potentially long time (have to allocate a page). Though, to be fair, as long as you've got way more than 16MB of RAM, the memory stuff shouldn't take too long, but the locks will be being held for a long time compared to the periods when you're not holding a lock of any sort. > and rwsems are "fair". If they weren't, you'd have to expect writer starvation in this situation. As it is, you're guaranteed progress on all threads. > CONFIG_PREEMPT_VOLUNTARY=y Which means the periods of lock-holding can be extended by preemption of the lock holder(s), making the whole situation that much worse. You have to remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex. > I run it all on a single CPU under `taskset -c 0' on the 8-way and it still > causes 160,000 context switches per second and takes 9.5 seconds (after > s/10/1000). How about if you have a UP kernel? (ie: spinlocks -> nops) > the context switch rate falls to zilch and total runtime falls to 6.4 > seconds. I presume you don't mean literally zero. > If that cond_resched() was not there, none of this would ever happen - each > thread merrily chugs away doing its ups and downs until it expires its > timeslice. Interesting, in a sad sort of way. The trouble is, I think, that you spend so much more time holding (or attempting to hold) locks than not, and preemption just exacerbates things. I suspect that the reason the problem doesn't seem so obvious when you've got 8 CPUs crunching their way through at once is probably because you can make progress on several read loops simultaneously fast enough that the preemption is lost in the things having to stop to give everyone writelocks. But short of recording the lock sequence, I don't think there's anyway to find out for sure. printk probably won't cut it as a recording mechanism because its overheads are too great. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Eric Dumazet wrote: Could you please add this patch and see if it helps on your machine ? [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem Avoids cache line dirtying I could, but I already know it's not going to help much. How do I know this? I already have 66% idle time when running with my patch (and without Nick Piggin's patch to take the mmap_sem for reading only). Interestingly, despite the idle time increasing from 10% to 66%, throughput triples... Saving some CPU time will probably only increase the idle time, I see no reason your patch would reduce contention and increase throughput. I'm not saying your patch doesn't make sense - it probably does. I just suspect it would have zero impact on this particular scenario, because of the already huge idle time. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Rik van Riel wrote: > > > MADV_DONTNEED, unpatched, 1000 loops > > > > real0m13.672s > > user0m1.217s > > sys 0m45.712s > > > > > > MADV_DONTNEED, with patch, 1000 loops > > > > real0m4.169s > > user0m2.033s > > sys 0m3.224s > > I just noticed something fun with these numbers. > > Without the patch, the system (a quad core CPU) is 10% idle. > > With the patch, it is 66% idle - presumably I need Nick's > mmap_sem patch. > > However, despite being 66% idle, the test still runs over > 3 times as fast! Please quote the context switch rate when testing this stuff (I use vmstat 1). I've seen it vary by a factor of 10,000 depending upon what's happening. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, 05 Apr 2007 03:31:24 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Jakub Jelinek wrote: > > > My guess is that all the page zeroing is pretty expensive as well and > > takes significant time, but I haven't profiled it. > > With the attached patch (Andrew, I'll change the details around > if you want - I just wanted something to test now), your test > case run time went down considerably. > > I modified the test case to only run 1000 loops, so it would run > a bit faster on my system. I also modified it to use MADV_DONTNEED > to zap the pages, instead of the mmap(PROT_NONE) thing you use. > Interesting... Could you please add this patch and see if it helps on your machine ? [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem Avoids cache line dirtying : The first cache line of mm_struct is/should_be mostly read. In case find_vma() hits the cache, we dont need to access the begining of mm_struct. Since we just dirtied mmap_sem, access to its cache line is free. In case find_vma() misses the cache, we dont need to dirty the begining of mm_struct. Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> --- linux-2.6.21-rc5/include/linux/sched.h +++ linux-2.6.21-rc5-ed/include/linux/sched.h @@ -310,7 +310,6 @@ typedef unsigned long mm_counter_t; struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; - struct vm_area_struct * mmap_cache; /* last find_vma result */ unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); @@ -324,6 +323,7 @@ struct mm_struct { atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; + struct vm_area_struct * mmap_cache; /* last find_vma result */ spinlock_t page_table_lock; /* Protects page tables and some counters */ struct list_head mmlist;/* List of maybe swapped mm's. These are globally strung - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Rik van Riel wrote: MADV_DONTNEED, unpatched, 1000 loops real0m13.672s user0m1.217s sys 0m45.712s MADV_DONTNEED, with patch, 1000 loops real0m4.169s user0m2.033s sys 0m3.224s I just noticed something fun with these numbers. Without the patch, the system (a quad core CPU) is 10% idle. With the patch, it is 66% idle - presumably I need Nick's mmap_sem patch. However, despite being 66% idle, the test still runs over 3 times as fast! -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Jakub Jelinek wrote: My guess is that all the page zeroing is pretty expensive as well and takes significant time, but I haven't profiled it. With the attached patch (Andrew, I'll change the details around if you want - I just wanted something to test now), your test case run time went down considerably. I modified the test case to only run 1000 loops, so it would run a bit faster on my system. I also modified it to use MADV_DONTNEED to zap the pages, instead of the mmap(PROT_NONE) thing you use. MADV_DONTNEED, unpatched, 1000 loops real0m13.672s user0m1.217s sys 0m45.712s MADV_DONTNEED, with patch, 1000 loops real0m4.169s user0m2.033s sys 0m3.224s -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.noarch/include/asm-alpha/mman.h.madvise 2007-04-04 16:44:50.0 -0400 +++ linux-2.6.20.noarch/include/asm-alpha/mman.h 2007-04-04 16:56:24.0 -0400 @@ -42,6 +42,7 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ +#define MADV_FREE 7 /* don't need the pages or the data */ /* common/generic parameters */ #define MADV_REMOVE 9 /* remove these pages & resources */ --- linux-2.6.20.noarch/include/asm-generic/mman.h.madvise 2007-04-04 16:44:50.0 -0400 +++ linux-2.6.20.noarch/include/asm-generic/mman.h 2007-04-04 16:56:53.0 -0400 @@ -29,6 +29,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages & resources */ --- linux-2.6.20.noarch/include/asm-mips/mman.h.madvise 2007-04-04 16:44:50.0 -0400 +++ linux-2.6.20.noarch/include/asm-mips/mman.h 2007-04-04 16:58:02.0 -0400 @@ -65,6 +65,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages & resources */ --- linux-2.6.20.noarch/include/asm-parisc/mman.h.madvise 2007-04-04 16:44:50.0 -0400 +++ linux-2.6.20.noarch/include/asm-parisc/mman.h 2007-04-04 16:58:40.0 -0400 @@ -38,6 +38,7 @@ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ +#define MADV_FREE 8 /* don't need the pages or the data */ /* common/generic parameters */ #define MADV_REMOVE 9 /* remove these pages & resources */ --- linux-2.6.20.noarch/include/asm-xtensa/mman.h.madvise 2007-04-04 16:44:51.0 -0400 +++ linux-2.6.20.noarch/include/asm-xtensa/mman.h 2007-04-04 16:59:14.0 -0400 @@ -72,6 +72,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages & resources */ --- linux-2.6.20.noarch/include/linux/mm_inline.h.madvise 2007-04-03 22:53:25.0 -0400 +++ linux-2.6.20.noarch/include/linux/mm_inline.h 2007-04-04 22:19:24.0 -0400 @@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z } static inline void +add_page_to_inactive_list_tail(struct zone *zone, struct page *page) +{ + list_add_tail(>lru, >inactive_list); + __inc_zone_state(zone, NR_INACTIVE); +} + +static inline void del_page_from_active_list(struct zone *zone, struct page *page) { list_del(>lru); --- linux-2.6.20.noarch/include/linux/mm.h.madvise 2007-04-03 22:53:25.0 -0400 +++ linux-2.6.20.noarch/include/linux/mm.h 2007-04-04 22:06:45.0 -0400 @@ -716,6 +716,7 @@ struct zap_details { pgoff_t last_index; /* Highest page->index to unmap */ spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ + short madv_free; /* MADV_FREE anonymous memory */ }; struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); --- linux-2.6.20.noarch/include/linux/page-flags.h.madvise 2007-04-03 22:54:58.0 -0400 +++ linux-2.6.20.noarch/include/linux/page-flags.h 2007-04-05 01:27:38.0 -0400 @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used
Re: missing madvise functionality
Ulrich Drepper a écrit : Eric Dumazet wrote: Database workload, where the user multi threaded app is constantly accessing GBytes of data, so L2 cache hit is very small. If you want to oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is in the top 5. We did have a workload with lots of Java and databases at some point when many VMAs were the issue. I brought this up here one, maybe two years ago and I think Blaisorblade went on and looked into avoiding VMA splits by having mprotect() not split VMAs and instead store the flags in the page table somewhere. I don't remember the details. Nothing came out of this but if this is possible it would be yet another way to avoid mmap_sem locking, right? I was speaking about oprofile needs, that may interfere with target process needs, since oprofile calls find_vma() on the target process mm and thus zap its mmap_cache. oprofile is yet another mmap_sem user, but also a mmap_cache destroyer. We could at least have a separate cache, only for oprofile. If done correctly we might avoid taking mmap_sem when the same vm_area_struct contains EIP/RIP snapshots. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, Apr 05, 2007 at 03:31:24AM -0400, Rik van Riel wrote: > >My guess is that all the page zeroing is pretty expensive as well and > >takes significant time, but I haven't profiled it. > > With the attached patch (Andrew, I'll change the details around > if you want - I just wanted something to test now), your test > case run time went down considerably. Thanks. --- linux-2.6.20.noarch/mm/madvise.c.madvise2007-04-03 21:53:47.0 -0400 +++ linux-2.6.20.noarch/mm/madvise.c2007-04-04 23:48:34.0 -0400 @@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a .last_index = ULONG_MAX, }; zap_page_range(vma, start, end - start, ); - } else - zap_page_range(vma, start, end - start, NULL); + } else { + struct zap_details details = { + .madv_free = 1, + }; + zap_page_range(vma, start, end - start, ); + } return 0; } @@ -209,7 +213,9 @@ madvise_vma(struct vm_area_struct *vma, error = madvise_willneed(vma, prev, start, end); break; + /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */ case MADV_DONTNEED: + case MADV_FREE: error = madvise_dontneed(vma, prev, start, end); break; I think you should only use the new behavior for madvise MADV_FREE, not for MADV_DONTNEED. The current MADV_DONTNEED behavior (which conflicts with POSIX POSIX_MADV_DONTNEED, but that doesn't matter since what glibc maps posix_madvise POSIX_MADV_DONTNEED in madvise call if anything doesn't have to be MADV_DONTNEED, but can be anything else) is apparently documented in Linux man pages: MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in re-loading of the memory contents from the under- lying mapped file (see mmap()) or zero-fill-on-demand pages for mappings without an underlying file. so it wouldn't surprise me if something relied on zero filling. So IMHO madv_free in details should be only set if MADV_FREE. Also, I think MADV_FREE shouldn't do anything at all (i.e. don't call zap_page_range, but don't fail either) for shared or file backed vmas, only for private anon memory it should do something. After all, it is just an optimization and it makes sense only for private anon mappings. Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, 05 Apr 2007 04:31:55 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Eric Dumazet wrote: > > > Could you please add this patch and see if it helps on your machine ? > > > > [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem > > > > Avoids cache line dirtying > > I could, but I already know it's not going to help much. > > How do I know this? I already have 66% idle time when running > with my patch (and without Nick Piggin's patch to take the > mmap_sem for reading only). Interestingly, despite the idle > time increasing from 10% to 66%, throughput triples... > > Saving some CPU time will probably only increase the idle time, > I see no reason your patch would reduce contention and increase > throughput. > > I'm not saying your patch doesn't make sense - it probably does. > I just suspect it would have zero impact on this particular > scenario, because of the already huge idle time. I know your cpus have idle time, that not the question. But *when* your cpus are not idle, they might be slowed down because of cache line transferts between them. This patch doesnt reduce contention, just latencies (and overall performance) I dont currently have SMP test machine, so I couldnt test it myself. On x86_64, I am pretty sure the patch would help, because offsetof(mmap_sem) = 0x60 On i386, offsetof(mmap_sem)=0x34, so this patch wont help. As you said, throughput can raise and idle time raise too. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Eric Dumazet wrote: > Database workload, where the user multi threaded app is constantly > accessing GBytes of data, so L2 cache hit is very small. If you want to > oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is > in the top 5. We did have a workload with lots of Java and databases at some point when many VMAs were the issue. I brought this up here one, maybe two years ago and I think Blaisorblade went on and looked into avoiding VMA splits by having mprotect() not split VMAs and instead store the flags in the page table somewhere. I don't remember the details. Nothing came out of this but if this is possible it would be yet another way to avoid mmap_sem locking, right? -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: missing madvise functionality
Nick Piggin a écrit : Eric Dumazet wrote: >> This was not a working patch, just to throw the idea, since the answers I got showed I was not understood. In this case, find_extend_vma() should of course have one struct vm_area_cache * argument, like find_vma() One single cache on one mm is not scalable. oprofile badly hits it on a dual cpu config. Oh, what sort of workload are you using to show this? The only reason that I didn't submit my thread cache patches was that I didn't show a big enough improvement. Database workload, where the user multi threaded app is constantly accessing GBytes of data, so L2 cache hit is very small. If you want to oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is in the top 5. Each time oprofile has an NMI, it calls find_vma(EIP/RIP) and blows out the target process cache (usually plugged on the data vma containing user land futexes). Event with private futexes, it will probably be plugged on the brk() vma. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Nick Piggin wrote: Oh, also: something like this patch would help out MADV_DONTNEED, as it means it can run concurrently with page faults. I think the locking will work (but needs forward porting). Ironically, your patch decreases throughput on my quad core test system, with Jakub's test case. MADV_DONTNEED, my patch, 1 loops (14k context switches/second) real0m34.890s user0m17.256s sys 0m29.797s MADV_DONTNEED, my patch your patch, 1 loops (50 context switches/second) real1m8.321s user0m20.840s sys 1m55.677s I suspect it's moving the contention onto the page table lock, in zap_pte_range(). I guess that the thread private memory areas must be living right next to each other, in the same page table lock regions :) For more real world workloads, like the MySQL sysbench one, I still suspect that your patch would improve things. Time to move back to debugging other stuff, though. Andrew, it would be nice if our patches could cook in -mm for a while. Want me to change anything before submitting? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
* David Howells [EMAIL PROTECTED] wrote: But short of recording the lock sequence, I don't think there's anyway to find out for sure. printk probably won't cut it as a recording mechanism because its overheads are too great. getting a good trace of it is easy: pick up the latest -rt kernel from: http://redhat.com/~mingo/realtime-preempt/ enable EVENT_TRACING in that kernel, run the workload and do: scripts/trace-it to-ingo.txt and send me the output. It will be large but interesting. That should get us a whole lot closer to what happens. A (much!) more finegrained result would be to also enable FUNCTION_TRACING and to do: echo 1 /proc/sys/kernel/mcount_enabled before running trace-it. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
On Thu, 05 Apr 2007 13:48:58 +0100 David Howells [EMAIL PROTECTED] wrote: Andrew Morton [EMAIL PROTECTED] wrote: What we effectively have is 32 threads on a single CPU all doing for (ever) { down_write() up_write() down_read() up_read(); } That's not quite so. In that test program, most loops do two d/u writes and then a slew of d/u reads with virtually no delay between them. One of the write-locked periods possibly lasts a relatively long time (it frees a bunch of pages), and the read-locked periods last a potentially long time (have to allocate a page). Whatever. I think it is still the case that the queueing behaviour of rwsems causes us to get into this abababababab scenario. And a single, sole, solitary cond_resched() is sufficient to trigger the whole process happening, and once it has started, it is sustained. If they weren't, you'd have to expect writer starvation in this situation. As it is, you're guaranteed progress on all threads. CONFIG_PREEMPT_VOLUNTARY=y Which means the periods of lock-holding can be extended by preemption of the lock holder(s), making the whole situation that much worse. You have to remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex. Of course - the same thing happens with CONFIG_PREEMPT=y. I run it all on a single CPU under `taskset -c 0' on the 8-way and it still causes 160,000 context switches per second and takes 9.5 seconds (after s/10/1000). How about if you have a UP kernel? (ie: spinlocks - nops) dunno. the context switch rate falls to zilch and total runtime falls to 6.4 seconds. I presume you don't mean literally zero. I do. At least, I was unable to discern any increase in the context-switch column in the `vmstat 1' output. If that cond_resched() was not there, none of this would ever happen - each thread merrily chugs away doing its ups and downs until it expires its timeslice. Interesting, in a sad sort of way. The trouble is, I think, that you spend so much more time holding (or attempting to hold) locks than not, and preemption just exacerbates things. No. Preemption *triggers* things. We're talking about an increase in context switch rate by a factor of at least 10,000. Something changed in a fundamental way. I suspect that the reason the problem doesn't seem so obvious when you've got 8 CPUs crunching their way through at once is probably because you can make progress on several read loops simultaneously fast enough that the preemption is lost in the things having to stop to give everyone writelocks. The context switch rate is enormous on SMP on all kernel configs. Perhaps a better way of looking at it is to observe that the special case of a single processor running a non-preemptible kernel simply got lucky. But short of recording the lock sequence, I don't think there's anyway to find out for sure. printk probably won't cut it as a recording mechanism because its overheads are too great. I think any code sequence which does for ( ; ; ) { down_write() up_write() down_read() up_read() } is vulnerable to the artifact which I described. I don't think we can (or should) do anything about it at the lock implementation level. It's more a matter of being aware of the possible failure modes of rwsems, and being more careful to avoid that situation in the code which uses rwsems. And, of course, being careful about when and where we use rwsems as opposed to other types of locks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
On Thu, 5 Apr 2007 21:11:29 +0200 Ingo Molnar [EMAIL PROTECTED] wrote: * David Howells [EMAIL PROTECTED] wrote: But short of recording the lock sequence, I don't think there's anyway to find out for sure. printk probably won't cut it as a recording mechanism because its overheads are too great. getting a good trace of it is easy: pick up the latest -rt kernel from: http://redhat.com/~mingo/realtime-preempt/ enable EVENT_TRACING in that kernel, run the workload and do: scripts/trace-it to-ingo.txt and send me the output. Did that - no output was generated. config at http://userweb.kernel.org/~akpm/config-akpm2.txt It will be large but interesting. That should get us a whole lot closer to what happens. A (much!) more finegrained result would be to also enable FUNCTION_TRACING and to do: echo 1 /proc/sys/kernel/mcount_enabled before running trace-it. Did that - still no output. I did get an interesting dmesg spew: http://userweb.kernel.org/~akpm/dmesg-akpm2.txt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, 05 Apr 2007 14:38:30 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Nick Piggin wrote: Oh, also: something like this patch would help out MADV_DONTNEED, as it means it can run concurrently with page faults. I think the locking will work (but needs forward porting). Ironically, your patch decreases throughput on my quad core test system, with Jakub's test case. MADV_DONTNEED, my patch, 1 loops (14k context switches/second) real0m34.890s user0m17.256s sys 0m29.797s MADV_DONTNEED, my patch your patch, 1 loops (50 context switches/second) real1m8.321s user0m20.840s sys 1m55.677s I suspect it's moving the contention onto the page table lock, in zap_pte_range(). I guess that the thread private memory areas must be living right next to each other, in the same page table lock regions :) Remember that we have two different ways of doing that locking: #if NR_CPUS = CONFIG_SPLIT_PTLOCK_CPUS /* * We tuck a spinlock to guard each pagetable page into its struct page, * at page-private, with BUILD_BUG_ON to make sure that this will not * overflow into the next struct page (as it might with DEBUG_SPINLOCK). * When freeing, reset page-mapping so free_pages_check won't complain. */ #define __pte_lockptr(page) ((page)-ptl) #define pte_lock_init(_page)do {\ spin_lock_init(__pte_lockptr(_page)); \ } while (0) #define pte_lock_deinit(page) ((page)-mapping = NULL) #define pte_lockptr(mm, pmd)({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));}) #else /* * We use mm-page_table_lock to guard all pagetable pages of the mm. */ #define pte_lock_init(page) do {} while (0) #define pte_lock_deinit(page) do {} while (0) #define pte_lockptr(mm, pmd)({(void)(pmd); (mm)-page_table_lock;}) #endif /* NR_CPUS CONFIG_SPLIT_PTLOCK_CPUS */ I wonder which way you're using, and whether using the other way changes things. For more real world workloads, like the MySQL sysbench one, I still suspect that your patch would improve things. Time to move back to debugging other stuff, though. Andrew, it would be nice if our patches could cook in -mm for a while. Want me to change anything before submitting? umm. I took a quick squint at a patch from you this morning and it looked OK to me. Please send the finalish thing when it is fully baked and performance-tested in the various regions of operation, thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Andrew Morton wrote: #if NR_CPUS = CONFIG_SPLIT_PTLOCK_CPUS I wonder which way you're using, and whether using the other way changes things. I'm using the default Fedora config file, which has NR_CPUS defined to 64 and CONFIG_SPLIT_PTLOCK_CPUS to 4, so I am using the split locks. However, I suspect that each 512kB malloced area will share one page table lock with 4 others, so some contention is to be expected. For more real world workloads, like the MySQL sysbench one, I still suspect that your patch would improve things. Time to move back to debugging other stuff, though. Andrew, it would be nice if our patches could cook in -mm for a while. Want me to change anything before submitting? umm. I took a quick squint at a patch from you this morning and it looked OK to me. Please send the finalish thing when it is fully baked and performance-tested in the various regions of operation, thanks. Will do. Ulrich has a test version of glibc available that uses MADV_DONTNEED for free(3), that should test this thing nicely. I'll run some tests with that when I get the time, hopefully next week. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Rik van Riel wrote: Nick Piggin wrote: Oh, also: something like this patch would help out MADV_DONTNEED, as it means it can run concurrently with page faults. I think the locking will work (but needs forward porting). Ironically, your patch decreases throughput on my quad core test system, with Jakub's test case. MADV_DONTNEED, my patch, 1 loops (14k context switches/second) real0m34.890s user0m17.256s sys 0m29.797s MADV_DONTNEED, my patch your patch, 1 loops (50 context switches/second) real1m8.321s user0m20.840s sys 1m55.677s I suspect it's moving the contention onto the page table lock, in zap_pte_range(). I guess that the thread private memory areas must be living right next to each other, in the same page table lock regions :) For more real world workloads, like the MySQL sysbench one, I still suspect that your patch would improve things. I think it definitely would, because the app will be wanting to do other things with mmap_sem as well (like futexes *grumble*). Also, the test case is allocating and freeing 512K chunks, which I think would be on the high side of typical. You have 32 threads for 4 CPUs, so then it would actually make sense to context switch on mmap_sem write lock rather than spin on ptl. But the kernel doesn't know that. Testing with a small chunk size or thread == CPUs I think would show a swing toward my patch. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Ulrich Drepper wrote: In case somebody wants to play around with Rik patch or another madvise-based patch, I have x86-64 glibc binaries which can use it: http://people.redhat.com/drepper/rpms These are based on the latest Fedora rawhide version. They should work on older systems, too, but you screw up your updates. Use them only if you know what you do. By default madvise(MADV_DONTNEED) is used. With the environment variable Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's kernels using down_write(mmap_sem) for MADV_DONTNEED is better than mmap/mprotect, which have more fundamental locking requirements, more overhead and no benefits (except debugging, I suppose). MADV_DONTNEED is twice as fast in single threaded performance, and an order of magnitude faster for multiple threads, when MADV_DONTNEED only takes mmap_sem for read. Do you plan to include this change in general glibc releases? Maybe it will make google malloc obsolete? ;) (I don't suppose you'd be able to get any tests done, Andrew?) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Nick Piggin wrote: Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's kernels using down_write(mmap_sem) for MADV_DONTNEED is better than mmap/mprotect, which have more fundamental locking requirements, more overhead and no benefits (except debugging, I suppose). It's a tiny bit faster, see http://people.redhat.com/drepper/dontneed.png I just ran it once so the graph is not smooth. This is on a UP dual core machine. Maybe tomorrow I'll turn on the big 4p machine. I would have to see dramatically different results on the big machine to make me change the libc code. The reason is that there is a big drawback. So far, when we allocate a new arena, we allocate address space with PROT_NONE and only when we need memory the protection is changed to PROT_READ|PROT_WRITE. This is the advantage of catching wild pointer accesses. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: missing madvise functionality
Ulrich Drepper wrote: Nick Piggin wrote: Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's kernels using down_write(mmap_sem) for MADV_DONTNEED is better than mmap/mprotect, which have more fundamental locking requirements, more overhead and no benefits (except debugging, I suppose). It's a tiny bit faster, see http://people.redhat.com/drepper/dontneed.png I just ran it once so the graph is not smooth. This is on a UP dual core machine. Maybe tomorrow I'll turn on the big 4p machine. Hmm, I saw an improvement, but that was just on a raw syscall test with a single page chunk. Real-world use I guess will get progressively less dramatic as other overheads start being introduced. Multi-thread performance probably won't get a whole lot better (it does eliminate 1 down_write(mmap_sem), but one remains) until you use my madvise patch. I would have to see dramatically different results on the big machine to make me change the libc code. The reason is that there is a big drawback. So far, when we allocate a new arena, we allocate address space with PROT_NONE and only when we need memory the protection is changed to PROT_READ|PROT_WRITE. This is the advantage of catching wild pointer accesses. Sure, yes. And I guess you'd always want to keep that options around as a debugging aid. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Eric Dumazet wrote: Database workload, where the user multi threaded app is constantly accessing GBytes of data, so L2 cache hit is very small. If you want to oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is in the top 5. We did have a workload with lots of Java and databases at some point when many VMAs were the issue. I brought this up here one, maybe two years ago and I think Blaisorblade went on and looked into avoiding VMA splits by having mprotect() not split VMAs and instead store the flags in the page table somewhere. I don't remember the details. Nothing came out of this but if this is possible it would be yet another way to avoid mmap_sem locking, right? -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: missing madvise functionality
Nick Piggin a écrit : Eric Dumazet wrote: This was not a working patch, just to throw the idea, since the answers I got showed I was not understood. In this case, find_extend_vma() should of course have one struct vm_area_cache * argument, like find_vma() One single cache on one mm is not scalable. oprofile badly hits it on a dual cpu config. Oh, what sort of workload are you using to show this? The only reason that I didn't submit my thread cache patches was that I didn't show a big enough improvement. Database workload, where the user multi threaded app is constantly accessing GBytes of data, so L2 cache hit is very small. If you want to oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is in the top 5. Each time oprofile has an NMI, it calls find_vma(EIP/RIP) and blows out the target process cache (usually plugged on the data vma containing user land futexes). Event with private futexes, it will probably be plugged on the brk() vma. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, Apr 05, 2007 at 03:31:24AM -0400, Rik van Riel wrote: My guess is that all the page zeroing is pretty expensive as well and takes significant time, but I haven't profiled it. With the attached patch (Andrew, I'll change the details around if you want - I just wanted something to test now), your test case run time went down considerably. Thanks. --- linux-2.6.20.noarch/mm/madvise.c.madvise2007-04-03 21:53:47.0 -0400 +++ linux-2.6.20.noarch/mm/madvise.c2007-04-04 23:48:34.0 -0400 @@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a .last_index = ULONG_MAX, }; zap_page_range(vma, start, end - start, details); - } else - zap_page_range(vma, start, end - start, NULL); + } else { + struct zap_details details = { + .madv_free = 1, + }; + zap_page_range(vma, start, end - start, details); + } return 0; } @@ -209,7 +213,9 @@ madvise_vma(struct vm_area_struct *vma, error = madvise_willneed(vma, prev, start, end); break; + /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */ case MADV_DONTNEED: + case MADV_FREE: error = madvise_dontneed(vma, prev, start, end); break; I think you should only use the new behavior for madvise MADV_FREE, not for MADV_DONTNEED. The current MADV_DONTNEED behavior (which conflicts with POSIX POSIX_MADV_DONTNEED, but that doesn't matter since what glibc maps posix_madvise POSIX_MADV_DONTNEED in madvise call if anything doesn't have to be MADV_DONTNEED, but can be anything else) is apparently documented in Linux man pages: MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in re-loading of the memory contents from the under- lying mapped file (see mmap()) or zero-fill-on-demand pages for mappings without an underlying file. so it wouldn't surprise me if something relied on zero filling. So IMHO madv_free in details should be only set if MADV_FREE. Also, I think MADV_FREE shouldn't do anything at all (i.e. don't call zap_page_range, but don't fail either) for shared or file backed vmas, only for private anon memory it should do something. After all, it is just an optimization and it makes sense only for private anon mappings. Jakub - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, 05 Apr 2007 04:31:55 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Eric Dumazet wrote: Could you please add this patch and see if it helps on your machine ? [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem Avoids cache line dirtying I could, but I already know it's not going to help much. How do I know this? I already have 66% idle time when running with my patch (and without Nick Piggin's patch to take the mmap_sem for reading only). Interestingly, despite the idle time increasing from 10% to 66%, throughput triples... Saving some CPU time will probably only increase the idle time, I see no reason your patch would reduce contention and increase throughput. I'm not saying your patch doesn't make sense - it probably does. I just suspect it would have zero impact on this particular scenario, because of the already huge idle time. I know your cpus have idle time, that not the question. But *when* your cpus are not idle, they might be slowed down because of cache line transferts between them. This patch doesnt reduce contention, just latencies (and overall performance) I dont currently have SMP test machine, so I couldnt test it myself. On x86_64, I am pretty sure the patch would help, because offsetof(mmap_sem) = 0x60 On i386, offsetof(mmap_sem)=0x34, so this patch wont help. As you said, throughput can raise and idle time raise too. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Jakub Jelinek wrote: My guess is that all the page zeroing is pretty expensive as well and takes significant time, but I haven't profiled it. With the attached patch (Andrew, I'll change the details around if you want - I just wanted something to test now), your test case run time went down considerably. I modified the test case to only run 1000 loops, so it would run a bit faster on my system. I also modified it to use MADV_DONTNEED to zap the pages, instead of the mmap(PROT_NONE) thing you use. MADV_DONTNEED, unpatched, 1000 loops real0m13.672s user0m1.217s sys 0m45.712s MADV_DONTNEED, with patch, 1000 loops real0m4.169s user0m2.033s sys 0m3.224s -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. --- linux-2.6.20.noarch/include/asm-alpha/mman.h.madvise 2007-04-04 16:44:50.0 -0400 +++ linux-2.6.20.noarch/include/asm-alpha/mman.h 2007-04-04 16:56:24.0 -0400 @@ -42,6 +42,7 @@ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_SPACEAVAIL 5 /* ensure resources are available */ #define MADV_DONTNEED 6 /* don't need these pages */ +#define MADV_FREE 7 /* don't need the pages or the data */ /* common/generic parameters */ #define MADV_REMOVE 9 /* remove these pages resources */ --- linux-2.6.20.noarch/include/asm-generic/mman.h.madvise 2007-04-04 16:44:50.0 -0400 +++ linux-2.6.20.noarch/include/asm-generic/mman.h 2007-04-04 16:56:53.0 -0400 @@ -29,6 +29,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages resources */ --- linux-2.6.20.noarch/include/asm-mips/mman.h.madvise 2007-04-04 16:44:50.0 -0400 +++ linux-2.6.20.noarch/include/asm-mips/mman.h 2007-04-04 16:58:02.0 -0400 @@ -65,6 +65,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages resources */ --- linux-2.6.20.noarch/include/asm-parisc/mman.h.madvise 2007-04-04 16:44:50.0 -0400 +++ linux-2.6.20.noarch/include/asm-parisc/mman.h 2007-04-04 16:58:40.0 -0400 @@ -38,6 +38,7 @@ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ #define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */ #define MADV_VPS_INHERIT 7 /* Inherit parents page size */ +#define MADV_FREE 8 /* don't need the pages or the data */ /* common/generic parameters */ #define MADV_REMOVE 9 /* remove these pages resources */ --- linux-2.6.20.noarch/include/asm-xtensa/mman.h.madvise 2007-04-04 16:44:51.0 -0400 +++ linux-2.6.20.noarch/include/asm-xtensa/mman.h 2007-04-04 16:59:14.0 -0400 @@ -72,6 +72,7 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* don't need the pages or the data */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages resources */ --- linux-2.6.20.noarch/include/linux/mm_inline.h.madvise 2007-04-03 22:53:25.0 -0400 +++ linux-2.6.20.noarch/include/linux/mm_inline.h 2007-04-04 22:19:24.0 -0400 @@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z } static inline void +add_page_to_inactive_list_tail(struct zone *zone, struct page *page) +{ + list_add_tail(page-lru, zone-inactive_list); + __inc_zone_state(zone, NR_INACTIVE); +} + +static inline void del_page_from_active_list(struct zone *zone, struct page *page) { list_del(page-lru); --- linux-2.6.20.noarch/include/linux/mm.h.madvise 2007-04-03 22:53:25.0 -0400 +++ linux-2.6.20.noarch/include/linux/mm.h 2007-04-04 22:06:45.0 -0400 @@ -716,6 +716,7 @@ struct zap_details { pgoff_t last_index; /* Highest page-index to unmap */ spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ + short madv_free; /* MADV_FREE anonymous memory */ }; struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); --- linux-2.6.20.noarch/include/linux/page-flags.h.madvise 2007-04-03 22:54:58.0 -0400 +++ linux-2.6.20.noarch/include/linux/page-flags.h 2007-04-05 01:27:38.0 -0400 @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /*
Re: missing madvise functionality
Eric Dumazet wrote: Could you please add this patch and see if it helps on your machine ? [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem Avoids cache line dirtying I could, but I already know it's not going to help much. How do I know this? I already have 66% idle time when running with my patch (and without Nick Piggin's patch to take the mmap_sem for reading only). Interestingly, despite the idle time increasing from 10% to 66%, throughput triples... Saving some CPU time will probably only increase the idle time, I see no reason your patch would reduce contention and increase throughput. I'm not saying your patch doesn't make sense - it probably does. I just suspect it would have zero impact on this particular scenario, because of the already huge idle time. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Ulrich Drepper a écrit : Eric Dumazet wrote: Database workload, where the user multi threaded app is constantly accessing GBytes of data, so L2 cache hit is very small. If you want to oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is in the top 5. We did have a workload with lots of Java and databases at some point when many VMAs were the issue. I brought this up here one, maybe two years ago and I think Blaisorblade went on and looked into avoiding VMA splits by having mprotect() not split VMAs and instead store the flags in the page table somewhere. I don't remember the details. Nothing came out of this but if this is possible it would be yet another way to avoid mmap_sem locking, right? I was speaking about oprofile needs, that may interfere with target process needs, since oprofile calls find_vma() on the target process mm and thus zap its mmap_cache. oprofile is yet another mmap_sem user, but also a mmap_cache destroyer. We could at least have a separate cache, only for oprofile. If done correctly we might avoid taking mmap_sem when the same vm_area_struct contains EIP/RIP snapshots. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Rik van Riel wrote: MADV_DONTNEED, unpatched, 1000 loops real0m13.672s user0m1.217s sys 0m45.712s MADV_DONTNEED, with patch, 1000 loops real0m4.169s user0m2.033s sys 0m3.224s I just noticed something fun with these numbers. Without the patch, the system (a quad core CPU) is 10% idle. With the patch, it is 66% idle - presumably I need Nick's mmap_sem patch. However, despite being 66% idle, the test still runs over 3 times as fast! Please quote the context switch rate when testing this stuff (I use vmstat 1). I've seen it vary by a factor of 10,000 depending upon what's happening. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Thu, 05 Apr 2007 03:31:24 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Jakub Jelinek wrote: My guess is that all the page zeroing is pretty expensive as well and takes significant time, but I haven't profiled it. With the attached patch (Andrew, I'll change the details around if you want - I just wanted something to test now), your test case run time went down considerably. I modified the test case to only run 1000 loops, so it would run a bit faster on my system. I also modified it to use MADV_DONTNEED to zap the pages, instead of the mmap(PROT_NONE) thing you use. Interesting... Could you please add this patch and see if it helps on your machine ? [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem Avoids cache line dirtying : The first cache line of mm_struct is/should_be mostly read. In case find_vma() hits the cache, we dont need to access the begining of mm_struct. Since we just dirtied mmap_sem, access to its cache line is free. In case find_vma() misses the cache, we dont need to dirty the begining of mm_struct. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- linux-2.6.21-rc5/include/linux/sched.h +++ linux-2.6.21-rc5-ed/include/linux/sched.h @@ -310,7 +310,6 @@ typedef unsigned long mm_counter_t; struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; - struct vm_area_struct * mmap_cache; /* last find_vma result */ unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); @@ -324,6 +323,7 @@ struct mm_struct { atomic_t mm_count; /* How many references to struct mm_struct (users count as 1) */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; + struct vm_area_struct * mmap_cache; /* last find_vma result */ spinlock_t page_table_lock; /* Protects page tables and some counters */ struct list_head mmlist;/* List of maybe swapped mm's. These are globally strung - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Rik van Riel wrote: MADV_DONTNEED, unpatched, 1000 loops real0m13.672s user0m1.217s sys 0m45.712s MADV_DONTNEED, with patch, 1000 loops real0m4.169s user0m2.033s sys 0m3.224s I just noticed something fun with these numbers. Without the patch, the system (a quad core CPU) is 10% idle. With the patch, it is 66% idle - presumably I need Nick's mmap_sem patch. However, despite being 66% idle, the test still runs over 3 times as fast! -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: preemption and rwsems (was: Re: missing madvise functionality)
Andrew Morton [EMAIL PROTECTED] wrote: What we effectively have is 32 threads on a single CPU all doing for (ever) { down_write() up_write() down_read() up_read(); } That's not quite so. In that test program, most loops do two d/u writes and then a slew of d/u reads with virtually no delay between them. One of the write-locked periods possibly lasts a relatively long time (it frees a bunch of pages), and the read-locked periods last a potentially long time (have to allocate a page). Though, to be fair, as long as you've got way more than 16MB of RAM, the memory stuff shouldn't take too long, but the locks will be being held for a long time compared to the periods when you're not holding a lock of any sort. and rwsems are fair. If they weren't, you'd have to expect writer starvation in this situation. As it is, you're guaranteed progress on all threads. CONFIG_PREEMPT_VOLUNTARY=y Which means the periods of lock-holding can be extended by preemption of the lock holder(s), making the whole situation that much worse. You have to remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex. I run it all on a single CPU under `taskset -c 0' on the 8-way and it still causes 160,000 context switches per second and takes 9.5 seconds (after s/10/1000). How about if you have a UP kernel? (ie: spinlocks - nops) the context switch rate falls to zilch and total runtime falls to 6.4 seconds. I presume you don't mean literally zero. If that cond_resched() was not there, none of this would ever happen - each thread merrily chugs away doing its ups and downs until it expires its timeslice. Interesting, in a sad sort of way. The trouble is, I think, that you spend so much more time holding (or attempting to hold) locks than not, and preemption just exacerbates things. I suspect that the reason the problem doesn't seem so obvious when you've got 8 CPUs crunching their way through at once is probably because you can make progress on several read loops simultaneously fast enough that the preemption is lost in the things having to stop to give everyone writelocks. But short of recording the lock sequence, I don't think there's anyway to find out for sure. printk probably won't cut it as a recording mechanism because its overheads are too great. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Andrew Morton wrote: On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Rik van Riel wrote: MADV_DONTNEED, unpatched, 1000 loops real0m13.672s user0m1.217s sys 0m45.712s MADV_DONTNEED, with patch, 1000 loops real0m4.169s user0m2.033s sys 0m3.224s I just noticed something fun with these numbers. Without the patch, the system (a quad core CPU) is 10% idle. With the patch, it is 66% idle - presumably I need Nick's mmap_sem patch. However, despite being 66% idle, the test still runs over 3 times as fast! Please quote the context switch rate when testing this stuff (I use vmstat 1). I've seen it vary by a factor of 10,000 depending upon what's happening. About context switches 14000 per second. I'll go compile in Nick's patch to see if that makes things go faster. I expect it will. procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 1 0 0 965232 250024 37084800 0 0 1026 13914 13 21 67 0 0 1 0 0 965232 250024 37084800 0 0 1018 14654 12 20 68 0 0 1 0 0 965232 250024 37084800 0 0 1023 14006 12 21 67 0 0 -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
In case somebody wants to play around with Rik patch or another madvise-based patch, I have x86-64 glibc binaries which can use it: http://people.redhat.com/drepper/rpms These are based on the latest Fedora rawhide version. They should work on older systems, too, but you screw up your updates. Use them only if you know what you do. By default madvise(MADV_DONTNEED) is used. With the environment variable MALLOC_MADVISE one can select a different hint. The value of the envvar must be the number of that other hint. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: missing madvise functionality
Jakub Jelinek wrote: + /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */ case MADV_DONTNEED: + case MADV_FREE: error = madvise_dontneed(vma, prev, start, end); break; I think you should only use the new behavior for madvise MADV_FREE, not for MADV_DONTNEED. I will. However, we need to double-use MADV_DONTNEED in this patch for now, so Ulrich's test glibc can be used easily :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 4 Apr 2007 06:09:18 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> Oh dear. On Wed, Apr 04, 2007 at 11:51:05AM -0700, Andrew Morton wrote: > what's all this about? I rewrote Jakub's testcase and included it as a MIME attachment. Current working version inline below. Also at http://holomorphy.com/~wli/jakub.c The basic idea was that I wanted a few more niceties, such as specifying the number of iterations and other things of that nature on the cmdline. I threw in a little code reorganization and error checking, too. -- wli #include #include #include #include #include #include #include #include #include #include enum thread_return { tr_success = 0, tr_mmap_init= -1, tr_mmap_free= -2, tr_mprotect = -3, tr_madvise = -4, tr_unknown = -5, tr_munmap = -6, }; enum release_method { release_by_mmap = 0, release_by_madvise = 1, release_by_max = 2, }; struct thread_argument { size_t page_size; int iterations, pages_per_thread, nr_threads; enum release_method method; }; static enum thread_return mmap_release(void *p, size_t n) { void *q; q = mmap(p, n, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0); if (p != q) { perror("thread_function: mmap release failed"); return tr_mmap_free; } if (mprotect(p, n, PROT_READ | PROT_WRITE)) { perror("thread_function: mprotect failed"); return tr_mprotect; } return tr_success; } static enum thread_return madvise_release(void *p, size_t n) { if (madvise(p, n, MADV_DONTNEED)) { perror("thread_function: madvise failed"); return tr_madvise; } return tr_success; } static enum thread_return (*release_methods[])(void *, size_t) = { mmap_release, madvise_release, }; static void *thread_function(void *__arg) { char *p; int i; struct thread_argument *arg = __arg; size_t arena_size = arg->pages_per_thread * arg->page_size; p = (char *)mmap(NULL, arena_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("thread_function: arena allocation failed"); return (void *)tr_mmap_init; } for (i = 0; i < arg->iterations; i++) { size_t s; char *q, *r; enum thread_return ret; /* Pretend to use the buffer. */ r = p + arena_size; for (q = p; q < r; q += arg->page_size) *q = 1; for (s = 0, q = p; q < r; q += arg->page_size) s += *q; if (arg->method >= release_by_max) { perror("thread_function: " "unknown freeing method specified"); return (void *)tr_unknown; } ret = (*release_methods[arg->method])(p, arena_size); if (ret != tr_success) return (void *)ret; } if (munmap(p, arena_size)) { perror("thread_function: munmap() failed"); return (void *)tr_munmap; } return (void *)tr_success; } static int configure(struct thread_argument *arg, int argc, char *argv[]) { char optstring[] = "t:m:i:p:"; int c, tmp, ret = 0; long n; n = sysconf(_SC_PAGE_SIZE); if (n < 0) { perror("configure: sysconf(_SC_PAGE_SIZE) failed"); ret = -1; } arg->nr_threads = 32, arg->page_size = (size_t)n; arg->method = release_by_mmap; arg->iterations = 10; arg->pages_per_thread = 128; while ((c = getopt(argc, argv, optstring)) != -1) { switch (c) { case 't': if (sscanf(optarg, "%d", ) == 1) arg->nr_threads = tmp; else { perror("configure: non-numeric thread count"); ret = -1; } break; case 'm': if (!strcmp(optarg, "mmap")) arg->method = release_by_mmap; else if (!strcmp(optarg, "madvise")) arg->method = release_by_madvise; else { perror("configure: unrecognised release method");
Re: missing madvise functionality
Nick Piggin wrote: Jakub Jelinek wrote: On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote: Does mmap(PROT_NONE) actually free the memory? Yes. /* Clear old maps */ error = -ENOMEM; munmap_back: vma = find_vma_prepare(mm, addr, , _link, _parent); if (vma && vma->vm_start < addr + len) { if (do_munmap(mm, addr, len)) return -ENOMEM; goto munmap_back; } Thanks, I overlooked the mmap vs mprotect detail. So how are the subsequent access faults avoided? AFAIKS, the faults are not avoided. Not for single page allocations, not for multi-page allocations. So what glibc currently does to allocate, use, then deallocate a page is this: mprotect(PROT_READ|PROT_WRITE) -> down_write(mmap_sem) touch page -> page fault -> down_read(mmap_sem) mmap(PROT_NONE) -> down_write(mmap_sem) What it could be doing is: touch page -> page fault -> down_read(mmap_sem) madvise(MADV_DONTNEED) -> down_read(mmap_sem) So after my previously posted patch (attached again) to only take down_read in madvise where possible... With 2 threads, the attached test.c ends up doing about 140,000 context switches per second with just 2 threads/2CPUs, takes a little over 2 million faults, and about 80 seconds to complete, when running the old_test() function (ie. mprotect,touch,mmap). When running new_test() (ie. touch,madvise), context switches stay well under 100, it takes slightly fewer faults, and it completes in about 8 seconds. With 1 thread, new_test() actually completes in under half the time as well (4.55 vs 9.88 seconds). This result won't have been altered by my madvise patch, because the down_write fastpath is no slower than down_read. Any comments? -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/madvise.c === --- linux-2.6.orig/mm/madvise.c +++ linux-2.6/mm/madvise.c @@ -12,6 +12,25 @@ #include /* + * Any behaviour which results in changes to the vma->vm_flags needs to + * take mmap_sem for writing. Others, which simply traverse vmas, need + * to only take it for reading. + */ +static int madvise_need_mmap_write(int behavior) +{ + switch (behavior) { + case MADV_DOFORK: + case MADV_DONTFORK: + case MADV_NORMAL: + case MADV_SEQUENTIAL: + case MADV_RANDOM: + return 1; + default: + return 0; + } +} + +/* * We can potentially split a vm area into separate * areas, each area with its own behavior. */ @@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon int error = -EINVAL; size_t len; - down_write(>mm->mmap_sem); + if (madvise_need_mmap_write(behavior)) + down_write(>mm->mmap_sem); + else + down_read(>mm->mmap_sem); if (start & ~PAGE_MASK) goto out; @@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon vma = prev->vm_next; } out: - up_write(>mm->mmap_sem); + if (madvise_need_mmap_write(behavior)) + up_write(>mm->mmap_sem); + else + up_read(>mm->mmap_sem); + return error; } #include #include #include #include #define NR_THREADS 1 #define ITERS 100 #define HEAPSIZE (4*1024) static void *old_thread(void *heap) { int i; for (i = 0; i < ITERS; i++) { char *mem = heap; if (mprotect(heap, HEAPSIZE, PROT_READ|PROT_WRITE) == -1) perror("mprotect"), exit(1); *mem = i; if (mmap(heap, HEAPSIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0) == MAP_FAILED) perror("mmap"), exit(1); } return NULL; } static void old_test(void) { void *heap; pthread_t pt[NR_THREADS]; int i; heap = mmap(NULL, NR_THREADS*HEAPSIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); if (heap == MAP_FAILED) perror("mmap"), exit(1); for (i = 0; i < NR_THREADS; i++) { if (pthread_create([i], NULL, old_thread, heap + i*HEAPSIZE) == -1) perror("pthread_create"), exit(1); } for (i = 0; i < NR_THREADS; i++) { if (pthread_join(pt[i], NULL) == -1) perror("pthread_join"), exit(1); } if (munmap(heap, NR_THREADS*HEAPSIZE) == -1) perror("munmap"), exit(1); } static void *new_thread(void *heap) { int i; for (i = 0; i < ITERS; i++) { char *mem = heap; *mem = i; if (madvise(heap, HEAPSIZE, MADV_DONTNEED) == -1) perror("madvise"), exit(1); } return NULL; } static void new_test(void) { void *heap; pthread_t pt[NR_THREADS]; int i; heap = mmap(NULL, HEAPSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); if (heap == MAP_FAILED) perror("mmap"), exit(1); for (i = 0; i < NR_THREADS; i++) { if (pthread_create([i], NULL, new_thread, heap + i*HEAPSIZE) == -1) perror("pthread_create"), exit(1); } for (i = 0; i < NR_THREADS; i++) { if (pthread_join(pt[i], NULL) == -1) perror("pthread_join"), exit(1); } if (munmap(heap, HEAPSIZE) ==
Re: missing madvise functionality
Eric Dumazet wrote: On Wed, 04 Apr 2007 20:05:54 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: @@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u unsigned long start; addr &= PAGE_MASK; - vma = find_vma(mm,addr); + vma = find_vma(mm,addr,>vmacache); if (!vma) return NULL; if (vma->vm_start <= addr) So now you can have current calling find_extend_vma on someone else's mm but using their cache. So you're going to return current's vma, or current is going to get one of mm's vmas in its cache :P This was not a working patch, just to throw the idea, since the answers I got showed I was not understood. In this case, find_extend_vma() should of course have one struct vm_area_cache * argument, like find_vma() One single cache on one mm is not scalable. oprofile badly hits it on a dual cpu config. Oh, what sort of workload are you using to show this? The only reason that I didn't submit my thread cache patches was that I didn't show a big enough improvement. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Hugh Dickins wrote: On Wed, 4 Apr 2007, Rik van Riel wrote: Hugh Dickins wrote: (I didn't understand how Rik would achieve his point 5, _no_ lock contention while repeatedly re-marking these pages, but never mind.) The CPU marks them accessed when they are reused. The VM only moves the reused pages back to the active list on memory pressure. This means that when the system is not under memory pressure, the same page can simply stay PG_lazyfree for multiple malloc/free rounds. Sure, there's no need for repetitious locking at the LRU end of it; but you said "if the system has lots of free memory, pages can go through multiple free/malloc cycles while sitting on the dontneed list, very lazily with no lock contention". I took that to mean, with userspace repeatedly madvising on the ranges they fall in, which will involve mmap_sem and ptl each time - just in order to check that no LRU movement is required each time. (Of course, there's also the problem that we don't leave our systems with lots of free memory: some LRU balancing decisions.) I don't agree this approach is the best one anyway. I'd rather just the simple MADV_DONTNEED/MADV_DONEED. Once you go through the trouble of protecting the memory and flushing TLBs, unprotecting them afterwards and taking a trap (even if it is a pure hardware trap), I doubt you've saved much. You may have saved the cost of zeroing out the page, but that has to be weighed against the fact that you have left a possibly cache hot page sitting there to get cold, and your accesses to initialise the malloced memory might have more cache misses. If you just free the page, it goes onto a nice LIFO cache hot list, and when you want to allocate another one, you'll probably get a cache hot one. The problem is down_write(mmap_sem) isn't it? We can and should easily fix that problem now. If we subsequently want to look at micro optimisations to avoid zeroing using MMU tricks, then we have a good base to compare with. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
preemption and rwsems (was: Re: missing madvise functionality)
On Tue, 3 Apr 2007 16:29:37 -0400 Jakub Jelinek <[EMAIL PROTECTED]> wrote: > #include > #include > #include > #include > > void * > tf (void *arg) > { > (void) arg; > size_t ps = sysconf (_SC_PAGE_SIZE); > void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (p == MAP_FAILED) > exit (1); > int i; > for (i = 0; i < 10; i++) > { > /* Pretend to use the buffer. */ > char *q, *r = (char *) p + 128 * ps; > size_t s; > for (q = (char *) p; q < r; q += ps) > *q = 1; > for (s = 0, q = (char *) p; q < r; q += ps) > s += *q; > /* Free it. Replace this mmap with > madvise (p, 128 * ps, MADV_THROWAWAY) when implemented. */ > if (mmap (p, 128 * ps, PROT_NONE, > MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p) > exit (2); > /* And immediately malloc again. This would then be deleted. */ > if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE)) > exit (3); > } > return NULL; > } > > int > main (void) > { > pthread_t th[32]; > int i; > for (i = 0; i < 32; i++) > if (pthread_create ([i], NULL, tf, NULL)) > exit (4); > for (i = 0; i < 32; i++) > pthread_join (th[i], NULL); > return 0; > } This little test app is fun. I run it all on a single CPU under `taskset -c 0' on the 8-way and it still causes 160,000 context switches per second and takes 9.5 seconds (after s/10/1000). The kernel has # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set and when I switch that to CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set the context switch rate falls to zilch and total runtime falls to 6.4 seconds. Presumably the same problem will occur with CONFIG_PREEMPT_VOLUNTARY on uniprocessor kernels. What we effectively have is 32 threads on a single CPU all doing for (ever) { down_write() up_write() down_read() up_read(); } and rwsems are "fair". So thread A thread B down_write(); cond_resched() ->schedule() down_read() -> blocks up_write() down_read() up_read() down_write() -> there's a reader: block down_read() -> succeeds up_read() down_write() -> there's another down_writer: block down_write() -> succeeds up_write() down_read() -> there's a down_writer: block down_write() succeeds up_write() down_read() -> succeeds up_read() down_write() -> there's a down_reader: block down_read() succeeds ad nauseum. If that cond_resched() was not there, none of this would ever happen - each thread merrily chugs away doing its ups and downs until it expires its timeslice. Interesting, in a sad sort of way. Setting CONFIG_PREEMPT_NONE doesn't appear to make any difference to context switch rate or runtime when all eight CPUs are used, so this phenomenon is unlikely to be involved in the mysql problem. I wonder why a similar thing doesn't happen when more than one CPU is used. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 04 Apr 2007 14:08:47 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > > There are other ways of doing it - I guess we could use a new page flag to > > indicate that this is one-of-those-pages, and add new code to handle it in > > all the right places. > > That's what I did. I'm currently working on the > zap_page_range() side of things. Let's try to avoid consuming another page flag if poss, please. Perhaps use PAGE_MAPPING_ANON's neighbouring bit? > > One thing which we haven't sorted out with all this stuff: once the > > application has marked an address range (and some pages) as > > whatever-were-going-call-this-feature, how does the application undo that > > change? > > It doesn't have to do anything. Just access the page and the > MMU will mark it dirty/accessed and the VM will not reclaim > it. um, OK. I suspect it would be good to clear the page's PageWhateverWereGoingToCallThisThing() state when this happens. Otherwise when the page gets clean again (ie: added to swapcache then written out) then it will look awfully similar to one of these new types of pages and things might get confusing. We'll see. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 4 Apr 2007 06:09:18 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: > > On Tue, Apr 03, 2007 at 04:29:37PM -0400, Jakub Jelinek wrote: > > void * > > tf (void *arg) > > { > > (void) arg; > > size_t ps = sysconf (_SC_PAGE_SIZE); > > void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE, > > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (p == MAP_FAILED) > > exit (1); > > int i; > > Oh dear. what's all this about? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Hi, > Oh. I was assuming that we'd want to unmap these pages from pagetables and > mark then super-easily-reclaimable. So a later touch would incur a minor > fault. > > But you think that we should leave them mapped into pagetables so no such > fault occurs. That would be very nice. The issues are not limited to threaded apps, we have seen performance problems with single threaded HPC applications that do a lot of large malloc/frees. It turns out the continual set up and tear down of pagetables when malloc uses mmap/free is a problem. At the moment the workaround is: export MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 which forces glibc malloc to use brk instead of mmap/free. Of course brk is good for keeping pagetables around but bad for keeping memory usage down. Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 4 Apr 2007, Andrew Morton wrote: > > The treatment is identical to clean swapcache pages, with the sole > exception that they don't actually consume any swap space - hence the fake > swapcache entry thing. I see, sneaking through try_to_unmap's anon PageSwapCache assumptions as simply as possible - thanks. (Coincidentally, Andrea pointed to precisely the same issue in the no PAGE_ZERO thread, when we were toying with writable but clean.) > One thing which we haven't sorted out with all this stuff: once the > application has marked an address range (and some pages) as > whatever-were-going-call-this-feature, how does the application undo > that change? By re-referencing the pages. (Hmm, so an incorrect app which accesses "free"d areas, will undo it: well, okay, nothing terrible about that.) > What effect will things like mremap, madvise and mlock have upon > these pages? mlock will undo the state in its make_pages_present: I guess that should happen in or near follow_page's mark_page_accessed. mremap? Other madvises? Nothing much at all: mremap can move them around, and the madvises do whatever they do - I don't notice any problem in that direction, but it'll be easier when we have an implementation to poke at. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Andrew Morton wrote: There are other ways of doing it - I guess we could use a new page flag to indicate that this is one-of-those-pages, and add new code to handle it in all the right places. That's what I did. I'm currently working on the zap_page_range() side of things. One thing which we haven't sorted out with all this stuff: once the application has marked an address range (and some pages) as whatever-were-going-call-this-feature, how does the application undo that change? It doesn't have to do anything. Just access the page and the MMU will mark it dirty/accessed and the VM will not reclaim it. What effect will things like mremap, madvise and mlock have upon these pages? Good point. I had not thought about these. Would you mind if I sent an initial proof of concept patch that does not take these into account, before we decide on what should happen in these cases? :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 4 Apr 2007 10:15:41 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote: > On Tue, 3 Apr 2007, Andrew Morton wrote: > > > > All of which indicates that if we can remove the down_write(mmap_sem) from > > this glibc operation, things should get a lot better - there will be no > > additional context switches at all. > > > > And we can surely do that if all we're doing is looking up pageframes, > > putting pages into fake-swapcache and moving them around on the page LRUs. > > > > Hugh? Sanity check? > > Setting aside the fake-swapcache part, yes, Rik should be able to do what > Ulrich wants (operating on ptes and pages) without down_write(mmap_sem): > just needing down_read(mmap_sem) to keep the whole vma/pagetable structure > stable, and page table lock (literal or per-page-table) for each contents. > > (I didn't understand how Rik would achieve his point 5, _no_ lock > contention while repeatedly re-marking these pages, but never mind.) > > (Some mails in this thread overlook that we also use down_write(mmap_sem) > to guard simple things like vma->vm_flags: of course that in itself could > be manipulated with atomics, or spinlock; but like many of the vma fields, > changing it goes hand in hand with the chance that we have to split vma, > which does require the heavy-handed down_write(mmap_sem). I expect that > splitting those uses apart would be harder than first appears, and better > to go for a more radical redesign - I don't know what.) > > But you lose me with the fake-swapcache part of it: that came, I think, > from your initial idea that it would be okay to refault on these ptes. > Don't we all agree now that we'd prefer not to refault on those ptes, > unless some memory pressure has actually decided to pull them out? > (Hmm, yet more list balancing...) The way in which we want to treat these pages is (I believe) to keep them if there's not a lot of memory pressure, but to reclaim them "easily" if there is some memory pressure. A simple way to do that is to move them onto the inactive list. But how do we handle these pages when the vm scanner encounters them? The treatment is identical to clean swapcache pages, with the sole exception that they don't actually consume any swap space - hence the fake swapcache entry thing. There are other ways of doing it - I guess we could use a new page flag to indicate that this is one-of-those-pages, and add new code to handle it in all the right places. One thing which we haven't sorted out with all this stuff: once the application has marked an address range (and some pages) as whatever-were-going-call-this-feature, how does the application undo that change? What effect will things like mremap, madvise and mlock have upon these pages? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 4 Apr 2007, Rik van Riel wrote: > Hugh Dickins wrote: > > > (I didn't understand how Rik would achieve his point 5, _no_ lock > > contention while repeatedly re-marking these pages, but never mind.) > > The CPU marks them accessed when they are reused. > > The VM only moves the reused pages back to the active list > on memory pressure. This means that when the system is > not under memory pressure, the same page can simply stay > PG_lazyfree for multiple malloc/free rounds. Sure, there's no need for repetitious locking at the LRU end of it; but you said "if the system has lots of free memory, pages can go through multiple free/malloc cycles while sitting on the dontneed list, very lazily with no lock contention". I took that to mean, with userspace repeatedly madvising on the ranges they fall in, which will involve mmap_sem and ptl each time - just in order to check that no LRU movement is required each time. (Of course, there's also the problem that we don't leave our systems with lots of free memory: some LRU balancing decisions.) Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Hugh Dickins wrote: (I didn't understand how Rik would achieve his point 5, _no_ lock contention while repeatedly re-marking these pages, but never mind.) The CPU marks them accessed when they are reused. The VM only moves the reused pages back to the active list on memory pressure. This means that when the system is not under memory pressure, the same page can simply stay PG_lazyfree for multiple malloc/free rounds. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 4 Apr 2007, Marko Macek wrote: > Ulrich Drepper wrote: > > A solution for this problem is a madvise() operation with the following > > property: > > > > - the content of the address range can be discarded > > > > - if an access to a page in the range happens in the future it must > > succeed. The old page content can be provided or a new, empty page > > can be provided > > Doesn't this conflict with disabling overcommit? > > If the page is guaranteed to be available, obviously it must count as > being commited, so this is not equivalent to real freeing. No, there's no conflict with disabled overcommit here: Committed_AS accounting is done on the whole vma size (at mmap or brk time), no matter how many pages may or may not be faulted in later. Rather like RLIMIT_AS. The proposed madvise operation won't affect it. (But I take Ulrich's "must succeed" with one pinch of salt: Out-Of-Memory killing remains a possibility, of course.) Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, Apr 04, 2007 at 06:09:18AM -0700, William Lee Irwin III wrote: > for (--i; i >= 0; --i) { > if (pthread_join(th[i], NULL)) { > perror("main: pthread_join failed"); > ret = EXIT_FAILURE; > } > } Obligatory brown paper bag patch: --- ./jakub.c.orig 2007-04-04 05:57:23.409493248 -0700 +++ ./jakub.c 2007-04-04 06:35:34.296043432 -0700 @@ -232,10 +232,14 @@ int main(int argc, char *argv[]) } } for (--i; i >= 0; --i) { - if (pthread_join(th[i], NULL)) { + void *status; + + if (pthread_join(th[i], )) { perror("main: pthread_join failed"); ret = EXIT_FAILURE; } + if (status != (void *)tr_success) + ret = EXIT_FAILURE; } free(th); getrusage(RUSAGE_SELF, ); -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Tue, Apr 03, 2007 at 04:29:37PM -0400, Jakub Jelinek wrote: > void * > tf (void *arg) > { > (void) arg; > size_t ps = sysconf (_SC_PAGE_SIZE); > void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (p == MAP_FAILED) > exit (1); > int i; Oh dear. -- wli #include #include #include #include #include #include #include #include #include #include enum thread_return { tr_success = 0, tr_mmap_init = -1, tr_mmap_free = -2, tr_mprotect = -3, tr_madvise = -4, tr_unknown = -5, tr_munmap = -6, }; enum release_method { release_by_mmap = 0, release_by_madvise = 1, release_by_max = 2, }; struct thread_argument { size_t page_size; int iterations, pages_per_thread, nr_threads; enum release_method method; }; static enum thread_return mmap_release(void *p, size_t n) { void *q; q = mmap(p, n, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0); if (p != q) { perror("thread_function: mmap release failed"); return tr_mmap_free; } if (mprotect(p, n, PROT_READ | PROT_WRITE)) { perror("thread_function: mprotect failed"); return tr_mprotect; } return tr_success; } static enum thread_return madvise_release(void *p, size_t n) { if (madvise(p, n, MADV_DONTNEED)) { perror("thread_function: madvise failed"); return tr_madvise; } return tr_success; } static enum thread_return (*release_methods[])(void *, size_t) = { mmap_release, madvise_release, }; static void *thread_function(void *__arg) { char *p; int i; struct thread_argument *arg = __arg; size_t arena_size = arg->pages_per_thread * arg->page_size; p = (char *)mmap(NULL, arena_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("thread_function: arena allocation failed"); return (void *)tr_mmap_init; } for (i = 0; i < arg->iterations; i++) { size_t s; char *q, *r; enum thread_return ret; /* Pretend to use the buffer. */ r = p + arena_size; for (q = p; q < r; q += arg->page_size) *q = 1; for (s = 0, q = p; q < r; q += arg->page_size) s += *q; if (arg->method >= release_by_max) { perror("thread_function: " "unknown freeing method specified"); return (void *)tr_unknown; } ret = (*release_methods[arg->method])(p, arena_size); if (ret != tr_success) return (void *)ret; } if (munmap(p, arena_size)) { perror("thread_function: munmap() failed"); return (void *)tr_munmap; } return (void *)tr_success; } static int configure(struct thread_argument *arg, int argc, char *argv[]) { char optstring[] = "t:m:i:p:"; int c, tmp, ret = 0; long n; n = sysconf(_SC_PAGE_SIZE); if (n < 0) { perror("configure: sysconf(_SC_PAGE_SIZE) failed"); ret = -1; } arg->nr_threads = 32, arg->page_size = (size_t)n; arg->method = release_by_mmap; arg->iterations = 10; arg->pages_per_thread = 128; while ((c = getopt(argc, argv, optstring)) != -1) { switch (c) { case 't': if (sscanf(optarg, "%d", ) == 1) arg->nr_threads = tmp; else { perror("configure: non-numeric thread count"); ret = -1; } break; case 'm': if (!strcmp(optarg, "mmap")) arg->method = release_by_mmap; else if (!strcmp(optarg, "madvise")) arg->method = release_by_madvise; else { perror("configure: unrecognised release method"); ret = -1; } break; case 'i': if (sscanf(optarg, "%d", ) == 1) arg->iterations = tmp; else { perror("configure: non-numeric iteration count"); ret = -1; } break; case 'p': if (sscanf(optarg, "%d", ) == 1) arg->pages_per_thread = tmp; else { perror("configure: non-numeric pages per thread count"); ret = -1; } break; default: perror("unrecognignized argument"); ret = -1; } } if (arg->nr_threads <= 0) { perror("configure: zero or negative thread count"); ret = -1; } if (arg->iterations < 0) { perror("configure: negative iteration count"); ret = -1; } if (arg->pages_per_thread <= 0) { perror("configure: zero or negative arena size"); ret = -1; } if (SIZE_MAX/arg->page_size < (size_t)arg->pages_per_thread) { perror("configure: arena size overflow"); ret = -1; } return ret; } static unsigned long long timeval_to_usec(struct timeval *tv) { return 100*tv->tv_sec + tv->tv_usec; } static unsigned long long elapsed_usec(struct timeval *tv1, struct timeval *tv2) { return timeval_to_usec(tv2) - timeval_to_usec(tv1); } #define user_usec(ru) timeval_to_usec(&(ru)->ru_utime) #define sys_usec(ru) timeval_to_usec(&(ru)->ru_stime) #define user_sec(ru) ((user_usec(ru) % 6000ULL)/100.0) #define sys_sec(ru) ((sys_usec(ru) % 6000ULL)/100.0) #define elapsed_sec(tv1, tv2) \ ((elapsed_usec(tv1, tv2) % 6000ULL)/100.0) #define user_min(ru) ((unsigned long)((user_usec(ru)/6000ULL) % 60)) #define sys_min(ru) ((unsigned long)((sys_usec(ru)/6000ULL) %
Re: missing madvise functionality
On Wed, 04 Apr 2007 20:05:54 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > @@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u > > unsigned long start; > > > > addr &= PAGE_MASK; > > - vma = find_vma(mm,addr); > > + vma = find_vma(mm,addr,>vmacache); > > if (!vma) > > return NULL; > > if (vma->vm_start <= addr) > > So now you can have current calling find_extend_vma on someone else's mm > but using their cache. So you're going to return current's vma, or current > is going to get one of mm's vmas in its cache :P This was not a working patch, just to throw the idea, since the answers I got showed I was not understood. In this case, find_extend_vma() should of course have one struct vm_area_cache * argument, like find_vma() One single cache on one mm is not scalable. oprofile badly hits it on a dual cpu config. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Eric Dumazet wrote: Well, I believe this one is too expensive. I was thinking of a light one : This one seems worse. Passing your vm_area_cache around everywhere, which is just intrusive and dangerous because ot becomes decoupled from the mm struct you are passing around. Watch this: @@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u unsigned long start; addr &= PAGE_MASK; - vma = find_vma(mm,addr); + vma = find_vma(mm,addr,>vmacache); if (!vma) return NULL; if (vma->vm_start <= addr) So now you can have current calling find_extend_vma on someone else's mm but using their cache. So you're going to return current's vma, or current is going to get one of mm's vmas in its cache :P -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Eric Dumazet wrote: On Wed, 04 Apr 2007 18:55:18 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: Peter Zijlstra wrote: On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote: Eric Dumazet wrote: I do think such workloads might benefit from a vma_cache not shared by all threads but private to each thread. A sequence could invalidate the cache(s). ie instead of a mm->mmap_cache, having a mm->sequence, and each thread having a current->mmap_cache and current->mm_sequence I have a patchset to do exactly this, btw. /me too However, I decided against pushing it because when it does happen that a task is not involved with a vma lookup for longer than it takes the seq count to wrap we have a stale pointer... We could go and walk the tasks once in a while to reset the pointer, but it all got a tad involved. Well here is my core patch (against I think 2.6.16 + a set of vma cache cleanups and abstractions). I didn't think the wrapping aspect was terribly involved. Well, I believe this one is too expensive. I was thinking of a light one : I am not deleting mmap_sem, but adding a sequence number to mm_struct, that is incremented each time a vma is added/deleted, not each time mmap_sem is taken (read or write) That's exactly what mine does (except IIRC it doesn't invalidate when you add a vma). Each thread has its own copy of the sequence, taken at the time find_vma() had to do a full lookup. I believe some optimized paths could call check_vma_cache() without mmap_sem read lock taken, and if it fails, take the mmap_sem lock and do the slow path. The mmap_sem for read does not only protect the mm_rb rbtree structure, but the vmas themselves as well as their page tables, so you can't do that. You could do it if you had a lock-per-vma to synchronise against write operations, and rcu-freed vmas or some such... but I don't think we should go down a road like that until we first remove mmap_sem from low hanging things (like private futexes!) and then see who's complaining. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 04 Apr 2007 18:55:18 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > Peter Zijlstra wrote: > > On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote: > > > >>Eric Dumazet wrote: > > > > > >>>I do think such workloads might benefit from a vma_cache not shared by > >>>all threads but private to each thread. A sequence could invalidate the > >>>cache(s). > >>> > >>>ie instead of a mm->mmap_cache, having a mm->sequence, and each thread > >>>having a current->mmap_cache and current->mm_sequence > >> > >>I have a patchset to do exactly this, btw. > > > > > > /me too > > > > However, I decided against pushing it because when it does happen that a > > task is not involved with a vma lookup for longer than it takes the seq > > count to wrap we have a stale pointer... > > > > We could go and walk the tasks once in a while to reset the pointer, but > > it all got a tad involved. > > Well here is my core patch (against I think 2.6.16 + a set of vma cache > cleanups and abstractions). I didn't think the wrapping aspect was > terribly involved. Well, I believe this one is too expensive. I was thinking of a light one : I am not deleting mmap_sem, but adding a sequence number to mm_struct, that is incremented each time a vma is added/deleted, not each time mmap_sem is taken (read or write) Each thread has its own copy of the sequence, taken at the time find_vma() had to do a full lookup. I believe some optimized paths could call check_vma_cache() without mmap_sem read lock taken, and if it fails, take the mmap_sem lock and do the slow path. --- linux-2.6.21-rc5/include/linux/sched.h +++ linux-2.6.21-rc5-ed/include/linux/sched.h @@ -319,10 +319,14 @@ typedef unsigned long mm_counter_t; (mm)->hiwater_vm = (mm)->total_vm; \ } while (0) +struct vm_area_cache { + struct vm_area_struct * mmap_cache; /* last find_vma result */ + unsigned int sequence; + }; + struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; - struct vm_area_struct * mmap_cache; /* last find_vma result */ unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); @@ -336,6 +340,7 @@ struct mm_struct { atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; + unsigned int mm_sequence; spinlock_t page_table_lock; /* Protects page tables and some counters */ struct list_head mmlist;/* List of maybe swapped mm's. These are globally strung @@ -875,7 +880,7 @@ struct task_struct { struct list_head tasks; struct mm_struct *mm, *active_mm; - + struct vm_area_cache vmacache; /* task state */ struct linux_binfmt *binfmt; int exit_state; --- linux-2.6.21-rc5/include/linux/mm.h +++ linux-2.6.21-rc5-ed/include/linux/mm.h @@ -1176,15 +1176,18 @@ extern int expand_upwards(struct vm_area #endif /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ -extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr); +extern struct vm_area_struct * find_vma(struct mm_struct * mm, + unsigned long addr, + struct vm_area_cache *cache); extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr, struct vm_area_struct **pprev); /* Look up the first VMA which intersects the interval start_addr..end_addr-1, NULL if none. Assume start_addr < end_addr. */ -static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr) +static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, + unsigned long start_addr, unsigned long end_addr, struct vm_area_cache *cache) { - struct vm_area_struct * vma = find_vma(mm,start_addr); + struct vm_area_struct * vma = find_vma(mm,start_addr,cache); if (vma && end_addr <= vma->vm_start) vma = NULL; --- linux-2.6.21-rc5/mm/mmap.c +++ linux-2.6.21-rc5-ed/mm/mmap.c @@ -267,7 +267,7 @@ asmlinkage unsigned long sys_brk(unsigne } /* Check against existing mmap mappings. */ - if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE)) + if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE, >vmacache)) goto out; /* Ok, looks good - let it rip. */ @@ -447,6 +447,7 @@ static void vma_link(struct mm_struct *m spin_unlock(>i_mmap_lock); mm->map_count++; + mm->mm_sequence++; validate_mm(mm); }
Re: missing madvise functionality
William Lee Irwin III wrote: On Wed, Apr 04, 2007 at 06:55:18PM +1000, Nick Piggin wrote: + rcu_read_lock(); + do { + t->vma_cache_sequence = -1; + t = next_thread(t); + } while (t != curr); + rcu_read_unlock(); LD_ASSUME_KERNEL=2.4.18 anyone? Meaning? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Tue, 3 Apr 2007, Andrew Morton wrote: > > All of which indicates that if we can remove the down_write(mmap_sem) from > this glibc operation, things should get a lot better - there will be no > additional context switches at all. > > And we can surely do that if all we're doing is looking up pageframes, > putting pages into fake-swapcache and moving them around on the page LRUs. > > Hugh? Sanity check? Setting aside the fake-swapcache part, yes, Rik should be able to do what Ulrich wants (operating on ptes and pages) without down_write(mmap_sem): just needing down_read(mmap_sem) to keep the whole vma/pagetable structure stable, and page table lock (literal or per-page-table) for each contents. (I didn't understand how Rik would achieve his point 5, _no_ lock contention while repeatedly re-marking these pages, but never mind.) (Some mails in this thread overlook that we also use down_write(mmap_sem) to guard simple things like vma->vm_flags: of course that in itself could be manipulated with atomics, or spinlock; but like many of the vma fields, changing it goes hand in hand with the chance that we have to split vma, which does require the heavy-handed down_write(mmap_sem). I expect that splitting those uses apart would be harder than first appears, and better to go for a more radical redesign - I don't know what.) But you lose me with the fake-swapcache part of it: that came, I think, from your initial idea that it would be okay to refault on these ptes. Don't we all agree now that we'd prefer not to refault on those ptes, unless some memory pressure has actually decided to pull them out? (Hmm, yet more list balancing...) Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, Apr 04, 2007 at 06:55:18PM +1000, Nick Piggin wrote: > + rcu_read_lock(); > + do { > + t->vma_cache_sequence = -1; > + t = next_thread(t); > + } while (t != curr); > + rcu_read_unlock(); LD_ASSUME_KERNEL=2.4.18 anyone? -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Peter Zijlstra wrote: On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote: Eric Dumazet wrote: I do think such workloads might benefit from a vma_cache not shared by all threads but private to each thread. A sequence could invalidate the cache(s). ie instead of a mm->mmap_cache, having a mm->sequence, and each thread having a current->mmap_cache and current->mm_sequence I have a patchset to do exactly this, btw. /me too However, I decided against pushing it because when it does happen that a task is not involved with a vma lookup for longer than it takes the seq count to wrap we have a stale pointer... We could go and walk the tasks once in a while to reset the pointer, but it all got a tad involved. Well here is my core patch (against I think 2.6.16 + a set of vma cache cleanups and abstractions). I didn't think the wrapping aspect was terribly involved. -- SUSE Labs, Novell Inc. Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -296,6 +296,8 @@ struct mm_struct { struct vm_area_struct *mmap;/* list of VMAs */ struct rb_root mm_rb; struct vm_area_struct *vma_cache; /* find_vma cache */ + unsigned long vma_sequence; + unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); @@ -693,6 +695,8 @@ enum sleep_type { SLEEP_INTERRUPTED, }; +#define VMA_CACHE_SIZE 4 + struct task_struct { volatile long state;/* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -734,6 +738,8 @@ struct task_struct { struct list_head ptrace_list; struct mm_struct *mm, *active_mm; + struct vm_area_struct *vma_cache[VMA_CACHE_SIZE]; + unsigned long vma_cache_sequence; /* task state */ struct linux_binfmt *binfmt; Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c +++ linux-2.6/mm/mmap.c @@ -32,6 +32,40 @@ static void vma_cache_touch(struct mm_struct *mm, struct vm_area_struct *vma) { + struct task_struct *curr = current; + if (mm == curr->mm) { + int i; + if (curr->vma_cache_sequence != mm->vma_sequence) { + curr->vma_cache_sequence = mm->vma_sequence; + curr->vma_cache[0] = vma; + for (i = 1; i < VMA_CACHE_SIZE; i++) + curr->vma_cache[i] = NULL; + } else { + int update_mm; + + if (curr->vma_cache[0] == vma) + return; + + for (i = 1; i < VMA_CACHE_SIZE; i++) { + if (curr->vma_cache[i] == vma) + break; + } + update_mm = 0; + if (i == VMA_CACHE_SIZE) { + update_mm = 1; + i = VMA_CACHE_SIZE-1; + } + while (i != 0) { + curr->vma_cache[i] = curr->vma_cache[i-1]; + i--; + } + curr->vma_cache[0] = vma; + + if (!update_mm) + return; + } + } + if (mm->vma_cache != vma) /* prevent cacheline bouncing */ mm->vma_cache = vma; } @@ -39,27 +73,56 @@ static void vma_cache_touch(struct mm_st static void vma_cache_replace(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *repl) { + mm->vma_sequence++; + if (unlikely(mm->vma_sequence == 0)) { + struct task_struct *curr = current, *t; + t = curr; + rcu_read_lock(); + do { + t->vma_cache_sequence = -1; + t = next_thread(t); + } while (t != curr); + rcu_read_unlock(); + } + if (mm->vma_cache == vma) mm->vma_cache = repl; } static void vma_cache_invalidate(struct mm_struct *mm, struct vm_area_struct *vma) { - if (mm->vma_cache == vma) - mm->vma_cache = NULL; + vma_cache_replace(mm, vma, NULL); } static struct vm_area_struct *vma_cache_find(struct mm_struct *mm, unsigned long addr) { - struct vm_area_struct *vma = mm->vma_cache; + struct task_struct *curr; + struct vm_area_struct *vma; preempt_disable(); __inc_page_state(vma_cache_query); - if (vma
Re: missing madvise functionality
Jakub Jelinek wrote: On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote: Does mmap(PROT_NONE) actually free the memory? Yes. /* Clear old maps */ error = -ENOMEM; munmap_back: vma = find_vma_prepare(mm, addr, , _link, _parent); if (vma && vma->vm_start < addr + len) { if (do_munmap(mm, addr, len)) return -ENOMEM; goto munmap_back; } Thanks, I overlooked the mmap vs mprotect detail. So how are the subsequent access faults avoided? In the case of pages being unused then almost immediately reused, why is it a bad solution to avoid freeing? Is it that you want to avoid heuristics because in some cases they could fail and end up using memory? free(3) doesn't know if the memory will be reused soon, late or never. So avoiding trimming could substantially increase memory consumption with certain malloc/free patterns, especially in threaded programs that use multiple arenas. Implementing some sort of deferred memory trimming in malloc is "solving" the problem in a wrong place, each app really has no idea (and should not have) what the current system memory pressure is. Thanks for the clarification. Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault than a syscall? (including the cost of the TLB fill for the memory access after the syscall, of course). That's page fault per page rather than a syscall for the whole chunk, furthermore zeroing is expensive. Ah, for big allocations. OK, we could make a MADV_POPULATE to prefault pages (like mmap's MAP_POPULATE, but without the down_write(mmap_sem)). If you're just about to use the pages anyway, how much of a win would it be to avoid zeroing? We allocate cache hot pages for these guys... -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote: > Eric Dumazet wrote: > > I do think such workloads might benefit from a vma_cache not shared by > > all threads but private to each thread. A sequence could invalidate the > > cache(s). > > > > ie instead of a mm->mmap_cache, having a mm->sequence, and each thread > > having a current->mmap_cache and current->mm_sequence > > I have a patchset to do exactly this, btw. /me too However, I decided against pushing it because when it does happen that a task is not involved with a vma lookup for longer than it takes the seq count to wrap we have a stale pointer... We could go and walk the tasks once in a while to reset the pointer, but it all got a tad involved. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote: > Does mmap(PROT_NONE) actually free the memory? Yes. /* Clear old maps */ error = -ENOMEM; munmap_back: vma = find_vma_prepare(mm, addr, , _link, _parent); if (vma && vma->vm_start < addr + len) { if (do_munmap(mm, addr, len)) return -ENOMEM; goto munmap_back; } > In the case of pages being unused then almost immediately reused, why is > it a bad solution to avoid freeing? Is it that you want to avoid > heuristics because in some cases they could fail and end up using memory? free(3) doesn't know if the memory will be reused soon, late or never. So avoiding trimming could substantially increase memory consumption with certain malloc/free patterns, especially in threaded programs that use multiple arenas. Implementing some sort of deferred memory trimming in malloc is "solving" the problem in a wrong place, each app really has no idea (and should not have) what the current system memory pressure is. > Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault > than a syscall? (including the cost of the TLB fill for the memory access > after the syscall, of course). That's page fault per page rather than a syscall for the whole chunk, furthermore zeroing is expensive. We really want something like FreeBSD MADV_FREE in Linux, see e.g. http://mail.nl.linux.org/linux-mm/2000-03/msg00059.html for some details. Apparently FreeBSD malloc is using MADV_FREE for years (according to their CVS for 10 years already). Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Nick Piggin wrote: Ulrich Drepper wrote: People might remember the thread about mysql not scaling and pointing the finger quite happily at glibc. Well, the situation is not like that. The problem is glibc has to work around kernel limitations. If the malloc implementation detects that a large chunk of previously allocated memory is now free and unused it wants to return the memory to the system. What we currently have to do is this: to free: mmap(PROT_NONE) over the area to reuse: mprotect(PROT_READ|PROT_WRITE) Yep, that's expensive, both operations need to get locks preventing other threads from doing the same. Some people were quick to suggest that we simply avoid the freeing in many situations (that's what the patch submitted by Yanmin Zhang basically does). That's no solution. One of the very good properties of the current allocator is that it does not use much memory. Does mmap(PROT_NONE) actually free the memory? A solution for this problem is a madvise() operation with the following property: - the content of the address range can be discarded - if an access to a page in the range happens in the future it must succeed. The old page content can be provided or a new, empty page can be provided That's it. The current MADV_DONTNEED doesn't cut it because it zaps the pages, causing *all* future reuses to create page faults. This is what I guess happens in the mysql test case where the pages where unused and freed but then almost immediately reused. The page faults erased all the benefits of using one mprotect() call vs a pair of mmap()/mprotect() calls. Two questions. In the case of pages being unused then almost immediately reused, why is it a bad solution to avoid freeing? Is it that you want to avoid heuristics because in some cases they could fail and end up using memory? Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault than a syscall? (including the cost of the TLB fill for the memory access after the syscall, of course). zapping the pages puts them on a nice LIFO cache hot list of pages that can be quickly used when the next fault comes in, or used for any other allocation in the kernel. Putting them on some sort of reclaim list seems a bit pointless. Oh, also: something like this patch would help out MADV_DONTNEED, as it means it can run concurrently with page faults. I think the locking will work (but needs forward porting). BTW. and this way it becomes much more attractive than using mmap/mprotect can ever be, because they must take mmap_sem for writing always. You don't actually need to protect the ranges unless running with use after free debugging turned on, do you? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Ulrich Drepper wrote: People might remember the thread about mysql not scaling and pointing the finger quite happily at glibc. Well, the situation is not like that. The problem is glibc has to work around kernel limitations. If the malloc implementation detects that a large chunk of previously allocated memory is now free and unused it wants to return the memory to the system. What we currently have to do is this: to free: mmap(PROT_NONE) over the area to reuse: mprotect(PROT_READ|PROT_WRITE) Yep, that's expensive, both operations need to get locks preventing other threads from doing the same. Some people were quick to suggest that we simply avoid the freeing in many situations (that's what the patch submitted by Yanmin Zhang basically does). That's no solution. One of the very good properties of the current allocator is that it does not use much memory. Does mmap(PROT_NONE) actually free the memory? A solution for this problem is a madvise() operation with the following property: - the content of the address range can be discarded - if an access to a page in the range happens in the future it must succeed. The old page content can be provided or a new, empty page can be provided That's it. The current MADV_DONTNEED doesn't cut it because it zaps the pages, causing *all* future reuses to create page faults. This is what I guess happens in the mysql test case where the pages where unused and freed but then almost immediately reused. The page faults erased all the benefits of using one mprotect() call vs a pair of mmap()/mprotect() calls. Two questions. In the case of pages being unused then almost immediately reused, why is it a bad solution to avoid freeing? Is it that you want to avoid heuristics because in some cases they could fail and end up using memory? Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault than a syscall? (including the cost of the TLB fill for the memory access after the syscall, of course). zapping the pages puts them on a nice LIFO cache hot list of pages that can be quickly used when the next fault comes in, or used for any other allocation in the kernel. Putting them on some sort of reclaim list seems a bit pointless. Oh, also: something like this patch would help out MADV_DONTNEED, as it means it can run concurrently with page faults. I think the locking will work (but needs forward porting). -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/madvise.c === --- linux-2.6.orig/mm/madvise.c +++ linux-2.6/mm/madvise.c @@ -12,6 +12,25 @@ #include /* + * Any behaviour which results in changes to the vma->vm_flags needs to + * take mmap_sem for writing. Others, which simply traverse vmas, need + * to only take it for reading. + */ +static int madvise_need_mmap_write(int behavior) +{ + switch (behavior) { + case MADV_DOFORK: + case MADV_DONTFORK: + case MADV_NORMAL: + case MADV_SEQUENTIAL: + case MADV_RANDOM: + return 1; + default: + return 0; + } +} + +/* * We can potentially split a vm area into separate * areas, each area with its own behavior. */ @@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon int error = -EINVAL; size_t len; - down_write(>mm->mmap_sem); + if (madvise_need_mmap_write(behavior)) + down_write(>mm->mmap_sem); + else + down_read(>mm->mmap_sem); if (start & ~PAGE_MASK) goto out; @@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon vma = prev->vm_next; } out: - up_write(>mm->mmap_sem); + if (madvise_need_mmap_write(behavior)) + up_write(>mm->mmap_sem); + else + up_read(>mm->mmap_sem); + return error; }
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
On Tue, 03 Apr 2007 23:54:42 -0700 Ulrich Drepper <[EMAIL PROTECTED]> wrote: > Eric Dumazet wrote: > > You were CC on this one, you can find an archive here : > > You cc:ed my gmail account. I don't pick out mails sent to me there. > If you want me to look at something you have to send it to my > @redhat.com address. What I meant is : You got the mails and even replied to one of them :) http://lkml.org/lkml/2007/3/15/303 I will try to remember your email address, thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
Eric Dumazet wrote: > You were CC on this one, you can find an archive here : You cc:ed my gmail account. I don't pick out mails sent to me there. If you want me to look at something you have to send it to my @redhat.com address. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
Ulrich Drepper a écrit : Nick Piggin wrote: Sad. Although Ulrich did seem interested at one point I think? Ulrich, do you agree at least with the interface that Eric is proposing? I have no idea what you're talking about. You were CC on this one, you can find an archive here : http://lkml.org/lkml/2007/3/15/230 This avoids mmap_sem for private futexes (PTHREAD_PROCESS_PRIVATE semantic) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
Ulrich Drepper wrote: Nick Piggin wrote: Sad. Although Ulrich did seem interested at one point I think? Ulrich, do you agree at least with the interface that Eric is proposing? I have no idea what you're talking about. Private futexes. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
Nick Piggin wrote: > Sad. Although Ulrich did seem interested at one point I think? Ulrich, > do you agree at least with the interface that Eric is proposing? I have no idea what you're talking about. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
(sorry to change the subjet, I was initially going to send the threaded vma cache patches on list, but then decided they didn't have enough changelog!) Andrew Morton wrote: On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: Andrew, do you have any objections to putting Eric's fairly important patch at least into -mm? you know what to do ;) Well I did review them when he last posted, but simply didn't have much to say (that happened in a much older discussion about the private futex problem, and I ended up agreeing with this approach). Anyway I'll have another look when they get posted again. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > Andrew, do you have any objections to putting Eric's fairly > important patch at least into -mm? you know what to do ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patches] threaded vma patches (was Re: missing madvise functionality)
Eric Dumazet wrote: Nick Piggin a écrit : Eric Dumazet wrote: I do think such workloads might benefit from a vma_cache not shared by all threads but private to each thread. A sequence could invalidate the cache(s). ie instead of a mm->mmap_cache, having a mm->sequence, and each thread having a current->mmap_cache and current->mm_sequence I have a patchset to do exactly this, btw. Could you repost it please ? Sure. I'll send you them privately because they're against an older kernel. Anyway what is the status of the private futex work. I don't think that is very intrusive or complicated, so it should get merged ASAP (so then at least we have the interface there). It seems nobody but you and me cared. Sad. Although Ulrich did seem interested at one point I think? Ulrich, do you agree at least with the interface that Eric is proposing? If yes, then Andrew, do you have any objections to putting Eric's fairly important patch at least into -mm? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patches] threaded vma patches (was Re: missing madvise functionality)
Eric Dumazet wrote: Nick Piggin a écrit : Eric Dumazet wrote: I do think such workloads might benefit from a vma_cache not shared by all threads but private to each thread. A sequence could invalidate the cache(s). ie instead of a mm-mmap_cache, having a mm-sequence, and each thread having a current-mmap_cache and current-mm_sequence I have a patchset to do exactly this, btw. Could you repost it please ? Sure. I'll send you them privately because they're against an older kernel. Anyway what is the status of the private futex work. I don't think that is very intrusive or complicated, so it should get merged ASAP (so then at least we have the interface there). It seems nobody but you and me cared. Sad. Although Ulrich did seem interested at one point I think? Ulrich, do you agree at least with the interface that Eric is proposing? If yes, then Andrew, do you have any objections to putting Eric's fairly important patch at least into -mm? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Andrew, do you have any objections to putting Eric's fairly important patch at least into -mm? you know what to do ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
(sorry to change the subjet, I was initially going to send the threaded vma cache patches on list, but then decided they didn't have enough changelog!) Andrew Morton wrote: On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Andrew, do you have any objections to putting Eric's fairly important patch at least into -mm? you know what to do ;) Well I did review them when he last posted, but simply didn't have much to say (that happened in a much older discussion about the private futex problem, and I ended up agreeing with this approach). Anyway I'll have another look when they get posted again. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
Nick Piggin wrote: Sad. Although Ulrich did seem interested at one point I think? Ulrich, do you agree at least with the interface that Eric is proposing? I have no idea what you're talking about. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
Ulrich Drepper wrote: Nick Piggin wrote: Sad. Although Ulrich did seem interested at one point I think? Ulrich, do you agree at least with the interface that Eric is proposing? I have no idea what you're talking about. Private futexes. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
Ulrich Drepper a écrit : Nick Piggin wrote: Sad. Although Ulrich did seem interested at one point I think? Ulrich, do you agree at least with the interface that Eric is proposing? I have no idea what you're talking about. You were CC on this one, you can find an archive here : http://lkml.org/lkml/2007/3/15/230 This avoids mmap_sem for private futexes (PTHREAD_PROCESS_PRIVATE semantic) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
Eric Dumazet wrote: You were CC on this one, you can find an archive here : You cc:ed my gmail account. I don't pick out mails sent to me there. If you want me to look at something you have to send it to my @redhat.com address. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [patches] threaded vma patches (was Re: missing madvise functionality)
On Tue, 03 Apr 2007 23:54:42 -0700 Ulrich Drepper [EMAIL PROTECTED] wrote: Eric Dumazet wrote: You were CC on this one, you can find an archive here : You cc:ed my gmail account. I don't pick out mails sent to me there. If you want me to look at something you have to send it to my @redhat.com address. What I meant is : You got the mails and even replied to one of them :) http://lkml.org/lkml/2007/3/15/303 I will try to remember your email address, thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
Ulrich Drepper wrote: People might remember the thread about mysql not scaling and pointing the finger quite happily at glibc. Well, the situation is not like that. The problem is glibc has to work around kernel limitations. If the malloc implementation detects that a large chunk of previously allocated memory is now free and unused it wants to return the memory to the system. What we currently have to do is this: to free: mmap(PROT_NONE) over the area to reuse: mprotect(PROT_READ|PROT_WRITE) Yep, that's expensive, both operations need to get locks preventing other threads from doing the same. Some people were quick to suggest that we simply avoid the freeing in many situations (that's what the patch submitted by Yanmin Zhang basically does). That's no solution. One of the very good properties of the current allocator is that it does not use much memory. Does mmap(PROT_NONE) actually free the memory? A solution for this problem is a madvise() operation with the following property: - the content of the address range can be discarded - if an access to a page in the range happens in the future it must succeed. The old page content can be provided or a new, empty page can be provided That's it. The current MADV_DONTNEED doesn't cut it because it zaps the pages, causing *all* future reuses to create page faults. This is what I guess happens in the mysql test case where the pages where unused and freed but then almost immediately reused. The page faults erased all the benefits of using one mprotect() call vs a pair of mmap()/mprotect() calls. Two questions. In the case of pages being unused then almost immediately reused, why is it a bad solution to avoid freeing? Is it that you want to avoid heuristics because in some cases they could fail and end up using memory? Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault than a syscall? (including the cost of the TLB fill for the memory access after the syscall, of course). zapping the pages puts them on a nice LIFO cache hot list of pages that can be quickly used when the next fault comes in, or used for any other allocation in the kernel. Putting them on some sort of reclaim list seems a bit pointless. Oh, also: something like this patch would help out MADV_DONTNEED, as it means it can run concurrently with page faults. I think the locking will work (but needs forward porting). -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/madvise.c === --- linux-2.6.orig/mm/madvise.c +++ linux-2.6/mm/madvise.c @@ -12,6 +12,25 @@ #include linux/hugetlb.h /* + * Any behaviour which results in changes to the vma-vm_flags needs to + * take mmap_sem for writing. Others, which simply traverse vmas, need + * to only take it for reading. + */ +static int madvise_need_mmap_write(int behavior) +{ + switch (behavior) { + case MADV_DOFORK: + case MADV_DONTFORK: + case MADV_NORMAL: + case MADV_SEQUENTIAL: + case MADV_RANDOM: + return 1; + default: + return 0; + } +} + +/* * We can potentially split a vm area into separate * areas, each area with its own behavior. */ @@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon int error = -EINVAL; size_t len; - down_write(current-mm-mmap_sem); + if (madvise_need_mmap_write(behavior)) + down_write(current-mm-mmap_sem); + else + down_read(current-mm-mmap_sem); if (start ~PAGE_MASK) goto out; @@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon vma = prev-vm_next; } out: - up_write(current-mm-mmap_sem); + if (madvise_need_mmap_write(behavior)) + up_write(current-mm-mmap_sem); + else + up_read(current-mm-mmap_sem); + return error; }