Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-06 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> > i've attached an updated version of trace-it.c, which will turn this 
> > off itself, using a sysctl. I also made WAKEUP_TIMING default-off.
> 
> ok.  http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of
> 
>   taskset -c 0 ./jakubs-test-app
> 
> while the system was doing the 150,000 context switches/sec.
> 
> It isn't very interesting.

this shows an idle CPU#7: you should taskset -c 0 trace-it too - it only 
traces the current CPU by default. (there's the 
/proc/sys/kernel/trace_all_cpus flag to trace all cpus, but in this case 
we really want the trace of CPU#0)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-06 Thread Andrew Morton
On Fri, 6 Apr 2007 11:08:22 +0200
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> * Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > > getting a good trace of it is easy: pick up the latest -rt kernel 
> > > from:
> > > 
> > >   http://redhat.com/~mingo/realtime-preempt/
> > > 
> > > enable EVENT_TRACING in that kernel, run the workload and do:
> > > 
> > >   scripts/trace-it > to-ingo.txt
> > > 
> > > and send me the output.
> > 
> > Did that - no output was generated.  config at
> > http://userweb.kernel.org/~akpm/config-akpm2.txt
> 
> sorry, i forgot to mention that you should turn off 
> CONFIG_WAKEUP_TIMING.
> 
> i've attached an updated version of trace-it.c, which will turn this off 
> itself, using a sysctl. I also made WAKEUP_TIMING default-off.

ok.  http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of

taskset -c 0 ./jakubs-test-app

while the system was doing the 150,000 context switches/sec.

It isn't very interesting.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-06 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> > getting a good trace of it is easy: pick up the latest -rt kernel 
> > from:
> > 
> > http://redhat.com/~mingo/realtime-preempt/
> > 
> > enable EVENT_TRACING in that kernel, run the workload and do:
> > 
> > scripts/trace-it > to-ingo.txt
> > 
> > and send me the output.
> 
> Did that - no output was generated.  config at
> http://userweb.kernel.org/~akpm/config-akpm2.txt

sorry, i forgot to mention that you should turn off 
CONFIG_WAKEUP_TIMING.

i've attached an updated version of trace-it.c, which will turn this off 
itself, using a sysctl. I also made WAKEUP_TIMING default-off.

> I did get an interesting dmesg spew:
> http://userweb.kernel.org/~akpm/dmesg-akpm2.txt

yeah, it's stack footprint measurement/instrumentation. It's 
particularly effective at tracking the worst-case stack footprint if you 
have FUNCTION_TRACING enabled - because in that case the kernel measures 
the stack's size at every function entry point. It does a maximum search 
so after bootup (in search of the 'largest' stack frame) so it's a bit 
verbose, but gets alot rarer later on. If it bothers you then disable:

  CONFIG_DEBUG_STACKOVERFLOW=y

it could interfere with getting a quality scheduling trace anyway.

Ingo

/*
 * Copyright (C) 2005, Ingo Molnar <[EMAIL PROTECTED]>
 *
 * user-triggered tracing.
 *
 * The -rt kernel has a built-in kernel tracer, which will trace
 * all kernel function calls (and a couple of special events as well),
 * by using a build-time gcc feature that instruments all kernel
 * functions.
 *
 * The tracer is highly automated for a number of latency tracing purposes,
 * but it can also be switched into 'user-triggered' mode, which is a
 * half-automatic tracing mode where userspace apps start and stop the
 * tracer. This file shows a dumb example how to turn user-triggered
 * tracing on, and how to start/stop tracing. Note that if you do
 * multiple start/stop sequences, the kernel will do a maximum search
 * over their latencies, and will keep the trace of the largest latency
 * in /proc/latency_trace. The maximums are also reported to the kernel
 * log. (but can also be read from /proc/sys/kernel/preempt_max_latency)
 *
 * For the tracer to be activated, turn on CONFIG_EVENT_TRACING
 * in the .config, rebuild the kernel and boot into it. The trace will
 * get _alot_ more verbose if you also turn on CONFIG_FUNCTION_TRACING,
 * every kernel function call will be put into the trace. Note that
 * CONFIG_FUNCTION_TRACING has significant runtime overhead, so you dont
 * want to use it for performance testing :)
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main (int argc, char **argv)
{
int ret;

if (getuid() != 0) {
fprintf(stderr, "needs to run as root.\n");
exit(1);
}
ret = system("cat /proc/sys/kernel/mcount_enabled >/dev/null 
2>/dev/null");
if (ret) {
fprintf(stderr, "CONFIG_LATENCY_TRACING not enabled?\n");
exit(1);
}
system("echo 1 > /proc/sys/kernel/trace_user_triggered");
system("[ -e /proc/sys/kernel/wakeup_timing ] && echo 0 > 
/proc/sys/kernel/wakeup_timing");
system("echo 1 > /proc/sys/kernel/trace_enabled");
system("echo 1 > /proc/sys/kernel/mcount_enabled");
system("echo 0 > /proc/sys/kernel/trace_freerunning");
system("echo 0 > /proc/sys/kernel/trace_print_on_crash");
system("echo 0 > /proc/sys/kernel/trace_verbose");
system("echo 0 > /proc/sys/kernel/preempt_thresh 2>/dev/null");
system("echo 0 > /proc/sys/kernel/preempt_max_latency 2>/dev/null");

// start tracing
if (prctl(0, 1)) {
fprintf(stderr, "trace-it: couldnt start tracing!\n");
return 1;
}
usleep(100);
if (prctl(0, 0)) {
fprintf(stderr, "trace-it: couldnt stop tracing!\n");
return 1;
}

system("echo 0 > /proc/sys/kernel/trace_user_triggered");
system("echo 0 > /proc/sys/kernel/trace_enabled");
system("cat /proc/latency_trace");

return 0;
}




Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-06 Thread Ingo Molnar

* Andrew Morton [EMAIL PROTECTED] wrote:

  getting a good trace of it is easy: pick up the latest -rt kernel 
  from:
  
  http://redhat.com/~mingo/realtime-preempt/
  
  enable EVENT_TRACING in that kernel, run the workload and do:
  
  scripts/trace-it  to-ingo.txt
  
  and send me the output.
 
 Did that - no output was generated.  config at
 http://userweb.kernel.org/~akpm/config-akpm2.txt

sorry, i forgot to mention that you should turn off 
CONFIG_WAKEUP_TIMING.

i've attached an updated version of trace-it.c, which will turn this off 
itself, using a sysctl. I also made WAKEUP_TIMING default-off.

 I did get an interesting dmesg spew:
 http://userweb.kernel.org/~akpm/dmesg-akpm2.txt

yeah, it's stack footprint measurement/instrumentation. It's 
particularly effective at tracking the worst-case stack footprint if you 
have FUNCTION_TRACING enabled - because in that case the kernel measures 
the stack's size at every function entry point. It does a maximum search 
so after bootup (in search of the 'largest' stack frame) so it's a bit 
verbose, but gets alot rarer later on. If it bothers you then disable:

  CONFIG_DEBUG_STACKOVERFLOW=y

it could interfere with getting a quality scheduling trace anyway.

Ingo

/*
 * Copyright (C) 2005, Ingo Molnar [EMAIL PROTECTED]
 *
 * user-triggered tracing.
 *
 * The -rt kernel has a built-in kernel tracer, which will trace
 * all kernel function calls (and a couple of special events as well),
 * by using a build-time gcc feature that instruments all kernel
 * functions.
 *
 * The tracer is highly automated for a number of latency tracing purposes,
 * but it can also be switched into 'user-triggered' mode, which is a
 * half-automatic tracing mode where userspace apps start and stop the
 * tracer. This file shows a dumb example how to turn user-triggered
 * tracing on, and how to start/stop tracing. Note that if you do
 * multiple start/stop sequences, the kernel will do a maximum search
 * over their latencies, and will keep the trace of the largest latency
 * in /proc/latency_trace. The maximums are also reported to the kernel
 * log. (but can also be read from /proc/sys/kernel/preempt_max_latency)
 *
 * For the tracer to be activated, turn on CONFIG_EVENT_TRACING
 * in the .config, rebuild the kernel and boot into it. The trace will
 * get _alot_ more verbose if you also turn on CONFIG_FUNCTION_TRACING,
 * every kernel function call will be put into the trace. Note that
 * CONFIG_FUNCTION_TRACING has significant runtime overhead, so you dont
 * want to use it for performance testing :)
 */

#include unistd.h
#include stdio.h
#include stdlib.h
#include signal.h
#include sys/wait.h
#include sys/prctl.h
#include linux/unistd.h

int main (int argc, char **argv)
{
int ret;

if (getuid() != 0) {
fprintf(stderr, needs to run as root.\n);
exit(1);
}
ret = system(cat /proc/sys/kernel/mcount_enabled /dev/null 
2/dev/null);
if (ret) {
fprintf(stderr, CONFIG_LATENCY_TRACING not enabled?\n);
exit(1);
}
system(echo 1  /proc/sys/kernel/trace_user_triggered);
system([ -e /proc/sys/kernel/wakeup_timing ]  echo 0  
/proc/sys/kernel/wakeup_timing);
system(echo 1  /proc/sys/kernel/trace_enabled);
system(echo 1  /proc/sys/kernel/mcount_enabled);
system(echo 0  /proc/sys/kernel/trace_freerunning);
system(echo 0  /proc/sys/kernel/trace_print_on_crash);
system(echo 0  /proc/sys/kernel/trace_verbose);
system(echo 0  /proc/sys/kernel/preempt_thresh 2/dev/null);
system(echo 0  /proc/sys/kernel/preempt_max_latency 2/dev/null);

// start tracing
if (prctl(0, 1)) {
fprintf(stderr, trace-it: couldnt start tracing!\n);
return 1;
}
usleep(100);
if (prctl(0, 0)) {
fprintf(stderr, trace-it: couldnt stop tracing!\n);
return 1;
}

system(echo 0  /proc/sys/kernel/trace_user_triggered);
system(echo 0  /proc/sys/kernel/trace_enabled);
system(cat /proc/latency_trace);

return 0;
}




Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-06 Thread Andrew Morton
On Fri, 6 Apr 2007 11:08:22 +0200
Ingo Molnar [EMAIL PROTECTED] wrote:

 * Andrew Morton [EMAIL PROTECTED] wrote:
 
   getting a good trace of it is easy: pick up the latest -rt kernel 
   from:
   
 http://redhat.com/~mingo/realtime-preempt/
   
   enable EVENT_TRACING in that kernel, run the workload and do:
   
 scripts/trace-it  to-ingo.txt
   
   and send me the output.
  
  Did that - no output was generated.  config at
  http://userweb.kernel.org/~akpm/config-akpm2.txt
 
 sorry, i forgot to mention that you should turn off 
 CONFIG_WAKEUP_TIMING.
 
 i've attached an updated version of trace-it.c, which will turn this off 
 itself, using a sysctl. I also made WAKEUP_TIMING default-off.

ok.  http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of

taskset -c 0 ./jakubs-test-app

while the system was doing the 150,000 context switches/sec.

It isn't very interesting.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-06 Thread Ingo Molnar

* Andrew Morton [EMAIL PROTECTED] wrote:

  i've attached an updated version of trace-it.c, which will turn this 
  off itself, using a sysctl. I also made WAKEUP_TIMING default-off.
 
 ok.  http://userweb.kernel.org/~akpm/to-ingo.txt is the trace of
 
   taskset -c 0 ./jakubs-test-app
 
 while the system was doing the 150,000 context switches/sec.
 
 It isn't very interesting.

this shows an idle CPU#7: you should taskset -c 0 trace-it too - it only 
traces the current CPU by default. (there's the 
/proc/sys/kernel/trace_all_cpus flag to trace all cpus, but in this case 
we really want the trace of CPU#0)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Nick Piggin

Ulrich Drepper wrote:

Nick Piggin wrote:


Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
mmap/mprotect, which have more fundamental locking requirements, more
overhead and no benefits (except debugging, I suppose).



It's a tiny bit faster, see

  http://people.redhat.com/drepper/dontneed.png

I just ran it once so the graph is not smooth.  This is on a UP dual
core machine.  Maybe tomorrow I'll turn on the big 4p machine.


Hmm, I saw an improvement, but that was just on a raw syscall test
with a single page chunk. Real-world use I guess will get progressively
less dramatic as other overheads start being introduced.

Multi-thread performance probably won't get a whole lot better (it does
eliminate 1 down_write(mmap_sem), but one remains) until you use my
madvise patch.



I would have to see dramatically different results on the big machine to
make me change the libc code.  The reason is that there is a big drawback.

So far, when we allocate a new arena, we allocate address space with
PROT_NONE and only when we need memory the protection is changed to
PROT_READ|PROT_WRITE.  This is the advantage of catching wild pointer
accesses.


Sure, yes. And I guess you'd always want to keep that options around as
a debugging aid.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Ulrich Drepper
Nick Piggin wrote:
> Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
> kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
> mmap/mprotect, which have more fundamental locking requirements, more
> overhead and no benefits (except debugging, I suppose).

It's a tiny bit faster, see

  http://people.redhat.com/drepper/dontneed.png

I just ran it once so the graph is not smooth.  This is on a UP dual
core machine.  Maybe tomorrow I'll turn on the big 4p machine.

I would have to see dramatically different results on the big machine to
make me change the libc code.  The reason is that there is a big drawback.

So far, when we allocate a new arena, we allocate address space with
PROT_NONE and only when we need memory the protection is changed to
PROT_READ|PROT_WRITE.  This is the advantage of catching wild pointer
accesses.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: missing madvise functionality

2007-04-05 Thread Nick Piggin

Ulrich Drepper wrote:

In case somebody wants to play around with Rik patch or another
madvise-based patch, I have x86-64 glibc binaries which can use it:

  http://people.redhat.com/drepper/rpms

These are based on the latest Fedora rawhide version.  They should work
on older systems, too, but you screw up your updates.  Use them only if
you know what you do.

By default madvise(MADV_DONTNEED) is used.  With the environment variable


Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
mmap/mprotect, which have more fundamental locking requirements, more
overhead and no benefits (except debugging, I suppose).

MADV_DONTNEED is twice as fast in single threaded performance, and an
order of magnitude faster for multiple threads, when MADV_DONTNEED only
takes mmap_sem for read.

Do you plan to include this change in general glibc releases? Maybe it
will make google malloc obsolete? ;) (I don't suppose you'd be able to
get any tests done, Andrew?)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Nick Piggin

Rik van Riel wrote:

Nick Piggin wrote:


Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).



Ironically, your patch decreases throughput on my quad core
test system, with Jakub's test case.

MADV_DONTNEED, my patch, 1 loops  (14k context switches/second)

real0m34.890s
user0m17.256s
sys 0m29.797s


MADV_DONTNEED, my patch & your patch, 1 loops  (50 context 
switches/second)


real1m8.321s
user0m20.840s
sys 1m55.677s

I suspect it's moving the contention onto the page table lock,
in zap_pte_range().  I guess that the thread private memory
areas must be living right next to each other, in the same
page table lock regions :)

For more real world workloads, like the MySQL sysbench one,
I still suspect that your patch would improve things.


I think it definitely would, because the app will be wanting to
do other things with mmap_sem as well (like futexes *grumble*).

Also, the test case is allocating and freeing 512K chunks, which
I think would be on the high side of typical.

You have 32 threads for 4 CPUs, so then it would actually make
sense to context switch on mmap_sem write lock rather than spin
on ptl. But the kernel doesn't know that.

Testing with a small chunk size or thread == CPUs I think would
show a swing toward my patch.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Andrew Morton wrote:


#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS



I wonder which way you're using, and whether using the other way changes
things.


I'm using the default Fedora config file, which has
NR_CPUS defined to 64 and CONFIG_SPLIT_PTLOCK_CPUS
to 4, so I am using the split locks.

However, I suspect that each 512kB malloced area
will share one page table lock with 4 others, so
some contention is to be expected.


For more real world workloads, like the MySQL sysbench one,
I still suspect that your patch would improve things.

Time to move back to debugging other stuff, though.

Andrew, it would be nice if our patches could cook in -mm
for a while.  Want me to change anything before submitting?


umm.  I took a quick squint at a patch from you this morning and it looked
OK to me.  Please send the finalish thing when it is fully baked and
performance-tested in the various regions of operation, thanks.


Will do.

Ulrich has a test version of glibc available that
uses MADV_DONTNEED for free(3), that should test
this thing nicely.

I'll run some tests with that when I get the
time, hopefully next week.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Andrew Morton
On Thu, 05 Apr 2007 14:38:30 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:

> Nick Piggin wrote:
> 
> > Oh, also: something like this patch would help out MADV_DONTNEED, as it
> > means it can run concurrently with page faults. I think the locking will
> > work (but needs forward porting).
> 
> Ironically, your patch decreases throughput on my quad core
> test system, with Jakub's test case.
> 
> MADV_DONTNEED, my patch, 1 loops  (14k context switches/second)
> 
> real0m34.890s
> user0m17.256s
> sys 0m29.797s
> 
> 
> MADV_DONTNEED, my patch & your patch, 1 loops  (50 context 
> switches/second)
> 
> real1m8.321s
> user0m20.840s
> sys 1m55.677s
> 
> I suspect it's moving the contention onto the page table lock,
> in zap_pte_range().  I guess that the thread private memory
> areas must be living right next to each other, in the same
> page table lock regions :)

Remember that we have two different ways of doing that locking:


#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
/*
 * We tuck a spinlock to guard each pagetable page into its struct page,
 * at page->private, with BUILD_BUG_ON to make sure that this will not
 * overflow into the next struct page (as it might with DEBUG_SPINLOCK).
 * When freeing, reset page->mapping so free_pages_check won't complain.
 */
#define __pte_lockptr(page) &((page)->ptl)
#define pte_lock_init(_page)do {\
spin_lock_init(__pte_lockptr(_page));   \
} while (0)
#define pte_lock_deinit(page)   ((page)->mapping = NULL)
#define pte_lockptr(mm, pmd)({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
#else
/*
 * We use mm->page_table_lock to guard all pagetable pages of the mm.
 */
#define pte_lock_init(page) do {} while (0)
#define pte_lock_deinit(page)   do {} while (0)
#define pte_lockptr(mm, pmd)({(void)(pmd); &(mm)->page_table_lock;})
#endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */


I wonder which way you're using, and whether using the other way changes
things.


> For more real world workloads, like the MySQL sysbench one,
> I still suspect that your patch would improve things.
> 
> Time to move back to debugging other stuff, though.
> 
> Andrew, it would be nice if our patches could cook in -mm
> for a while.  Want me to change anything before submitting?

umm.  I took a quick squint at a patch from you this morning and it looked
OK to me.  Please send the finalish thing when it is fully baked and
performance-tested in the various regions of operation, thanks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-05 Thread Andrew Morton
On Thu, 5 Apr 2007 21:11:29 +0200
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> 
> * David Howells <[EMAIL PROTECTED]> wrote:
> 
> > But short of recording the lock sequence, I don't think there's anyway 
> > to find out for sure.  printk probably won't cut it as a recording 
> > mechanism because its overheads are too great.
> 
> getting a good trace of it is easy: pick up the latest -rt kernel from:
> 
>   http://redhat.com/~mingo/realtime-preempt/
> 
> enable EVENT_TRACING in that kernel, run the workload 
> and do:
> 
>   scripts/trace-it > to-ingo.txt
> 
> and send me the output.

Did that - no output was generated.  config at
http://userweb.kernel.org/~akpm/config-akpm2.txt

> It will be large but interesting. That should 
> get us a whole lot closer to what happens. A (much!) more finegrained 
> result would be to also enable FUNCTION_TRACING and to do:
> 
>   echo 1 > /proc/sys/kernel/mcount_enabled
> 
> before running trace-it.

Did that - still no output.

I did get an interesting dmesg spew:
http://userweb.kernel.org/~akpm/dmesg-akpm2.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-05 Thread Andrew Morton
On Thu, 05 Apr 2007 13:48:58 +0100
David Howells <[EMAIL PROTECTED]> wrote:

> Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > 
> > What we effectively have is 32 threads on a single CPU all doing
> > 
> > for (ever) {
> > down_write()
> > up_write()
> > down_read()
> > up_read();
> > }
> 
> That's not quite so.  In that test program, most loops do two d/u writes and
> then a slew of d/u reads with virtually no delay between them.  One of the
> write-locked periods possibly lasts a relatively long time (it frees a bunch
> of pages), and the read-locked periods last a potentially long time (have to
> allocate a page).

Whatever.  I think it is still the case that the queueing behaviour of
rwsems causes us to get into this abababababab scenario.  And a single,
sole, solitary cond_resched() is sufficient to trigger the whole process
happening, and once it has started, it is sustained.

> If they weren't, you'd have to expect writer starvation in this situation.  As
> it is, you're guaranteed progress on all threads.
> 
> > CONFIG_PREEMPT_VOLUNTARY=y
> 
> Which means the periods of lock-holding can be extended by preemption of the
> lock holder(s), making the whole situation that much worse.  You have to
> remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex.

Of course - the same thing happens with CONFIG_PREEMPT=y.

> > I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
> > causes 160,000 context switches per second and takes 9.5 seconds (after
> > s/10/1000).
> 
> How about if you have a UP kernel?  (ie: spinlocks -> nops)

dunno.

> > the context switch rate falls to zilch and total runtime falls to 6.4
> > seconds.
> 
> I presume you don't mean literally zero.

I do.  At least, I was unable to discern any increase in the context-switch
column in the `vmstat 1' output.

> > If that cond_resched() was not there, none of this would ever happen - each
> > thread merrily chugs away doing its ups and downs until it expires its
> > timeslice.  Interesting, in a sad sort of way.
> 
> The trouble is, I think, that you spend so much more time holding (or
> attempting to hold) locks than not, and preemption just exacerbates things.

No.  Preemption *triggers* things.  We're talking about an increase in
context switch rate by a factor of at least 10,000.  Something changed in a
fundamental way.

> I suspect that the reason the problem doesn't seem so obvious when you've got
> 8 CPUs crunching their way through at once is probably because you can make
> progress on several read loops simultaneously fast enough that the preemption
> is lost in the things having to stop to give everyone writelocks.

The context switch rate is enormous on SMP on all kernel configs.  Perhaps
a better way of looking at it is to observe that the special case of a
single processor running a non-preemptible kernel simply got lucky.

> But short of recording the lock sequence, I don't think there's anyway to find
> out for sure.  printk probably won't cut it as a recording mechanism because
> its overheads are too great.

I think any code sequence which does

for ( ; ; ) {
down_write()
up_write()
down_read()
up_read()
}

is vulnerable to the artifact which I described.


I don't think we can (or should) do anything about it at the lock
implementation level.  It's more a matter of being aware of the possible
failure modes of rwsems, and being more careful to avoid that situation in
the code which uses rwsems.  And, of course, being careful about when and
where we use rwsems as opposed to other types of locks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-05 Thread Ingo Molnar

* David Howells <[EMAIL PROTECTED]> wrote:

> But short of recording the lock sequence, I don't think there's anyway 
> to find out for sure.  printk probably won't cut it as a recording 
> mechanism because its overheads are too great.

getting a good trace of it is easy: pick up the latest -rt kernel from:

http://redhat.com/~mingo/realtime-preempt/

enable EVENT_TRACING in that kernel, run the workload 
and do:

scripts/trace-it > to-ingo.txt

and send me the output. It will be large but interesting. That should 
get us a whole lot closer to what happens. A (much!) more finegrained 
result would be to also enable FUNCTION_TRACING and to do:

echo 1 > /proc/sys/kernel/mcount_enabled

before running trace-it.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Nick Piggin wrote:


Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).


Ironically, your patch decreases throughput on my quad core
test system, with Jakub's test case.

MADV_DONTNEED, my patch, 1 loops  (14k context switches/second)

real0m34.890s
user0m17.256s
sys 0m29.797s


MADV_DONTNEED, my patch & your patch, 1 loops  (50 context 
switches/second)


real1m8.321s
user0m20.840s
sys 1m55.677s

I suspect it's moving the contention onto the page table lock,
in zap_pte_range().  I guess that the thread private memory
areas must be living right next to each other, in the same
page table lock regions :)

For more real world workloads, like the MySQL sysbench one,
I still suspect that your patch would improve things.

Time to move back to debugging other stuff, though.

Andrew, it would be nice if our patches could cook in -mm
for a while.  Want me to change anything before submitting?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Jakub Jelinek wrote:


+   /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
case MADV_DONTNEED:
+   case MADV_FREE:
error = madvise_dontneed(vma, prev, start, end);
break;
 
I think you should only use the new behavior for madvise MADV_FREE, not for
MADV_DONTNEED. 


I will.  However, we need to double-use MADV_DONTNEED in this
patch for now, so Ulrich's test glibc can be used easily :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Ulrich Drepper
In case somebody wants to play around with Rik patch or another
madvise-based patch, I have x86-64 glibc binaries which can use it:

  http://people.redhat.com/drepper/rpms

These are based on the latest Fedora rawhide version.  They should work
on older systems, too, but you screw up your updates.  Use them only if
you know what you do.

By default madvise(MADV_DONTNEED) is used.  With the environment variable

  MALLOC_MADVISE

one can select a different hint.  The value of the envvar must be the
number of that other hint.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Andrew Morton wrote:

On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote:


Rik van Riel wrote:


MADV_DONTNEED, unpatched, 1000 loops

real0m13.672s
user0m1.217s
sys 0m45.712s


MADV_DONTNEED, with patch, 1000 loops

real0m4.169s
user0m2.033s
sys 0m3.224s

I just noticed something fun with these numbers.

Without the patch, the system (a quad core CPU) is 10% idle.

With the patch, it is 66% idle - presumably I need Nick's
mmap_sem patch.

However, despite being 66% idle, the test still runs over
3 times as fast!


Please quote the context switch rate when testing this stuff (I use vmstat 1).
I've seen it vary by a factor of 10,000 depending upon what's happening.


About context switches 14000 per second.

I'll go compile in Nick's patch to see if that makes
things go faster.  I expect it will.

procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy 
id wa st
 1  0  0 965232 250024 37084800 0 0 1026 13914 13 
21 67  0  0
 1  0  0 965232 250024 37084800 0 0 1018 14654 12 
20 68  0  0
 1  0  0 965232 250024 37084800 0 0 1023 14006 12 
21 67  0  0



--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-05 Thread David Howells
Andrew Morton <[EMAIL PROTECTED]> wrote:

> 
> What we effectively have is 32 threads on a single CPU all doing
> 
>   for (ever) {
>   down_write()
>   up_write()
>   down_read()
>   up_read();
>   }

That's not quite so.  In that test program, most loops do two d/u writes and
then a slew of d/u reads with virtually no delay between them.  One of the
write-locked periods possibly lasts a relatively long time (it frees a bunch
of pages), and the read-locked periods last a potentially long time (have to
allocate a page).

Though, to be fair, as long as you've got way more than 16MB of RAM, the
memory stuff shouldn't take too long, but the locks will be being held for a
long time compared to the periods when you're not holding a lock of any sort.

> and rwsems are "fair".

If they weren't, you'd have to expect writer starvation in this situation.  As
it is, you're guaranteed progress on all threads.

> CONFIG_PREEMPT_VOLUNTARY=y

Which means the periods of lock-holding can be extended by preemption of the
lock holder(s), making the whole situation that much worse.  You have to
remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex.

> I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
> causes 160,000 context switches per second and takes 9.5 seconds (after
> s/10/1000).

How about if you have a UP kernel?  (ie: spinlocks -> nops)

> the context switch rate falls to zilch and total runtime falls to 6.4
> seconds.

I presume you don't mean literally zero.

> If that cond_resched() was not there, none of this would ever happen - each
> thread merrily chugs away doing its ups and downs until it expires its
> timeslice.  Interesting, in a sad sort of way.

The trouble is, I think, that you spend so much more time holding (or
attempting to hold) locks than not, and preemption just exacerbates things.

I suspect that the reason the problem doesn't seem so obvious when you've got
8 CPUs crunching their way through at once is probably because you can make
progress on several read loops simultaneously fast enough that the preemption
is lost in the things having to stop to give everyone writelocks.

But short of recording the lock sequence, I don't think there's anyway to find
out for sure.  printk probably won't cut it as a recording mechanism because
its overheads are too great.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Eric Dumazet wrote:


Could you please add this patch and see if it helps on your machine ?

[PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem

Avoids cache line dirtying


I could, but I already know it's not going to help much.

How do I know this?  I already have 66% idle time when running
with my patch (and without Nick Piggin's patch to take the
mmap_sem for reading only).  Interestingly, despite the idle
time increasing from 10% to 66%, throughput triples...

Saving some CPU time will probably only increase the idle time,
I see no reason your patch would reduce contention and increase
throughput.

I'm not saying your patch doesn't make sense - it probably does.
I just suspect it would have zero impact on this particular
scenario, because of the already huge idle time.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Andrew Morton
On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel <[EMAIL PROTECTED]> wrote:

> Rik van Riel wrote:
> 
> > MADV_DONTNEED, unpatched, 1000 loops
> > 
> > real0m13.672s
> > user0m1.217s
> > sys 0m45.712s
> > 
> > 
> > MADV_DONTNEED, with patch, 1000 loops
> > 
> > real0m4.169s
> > user0m2.033s
> > sys 0m3.224s
> 
> I just noticed something fun with these numbers.
> 
> Without the patch, the system (a quad core CPU) is 10% idle.
> 
> With the patch, it is 66% idle - presumably I need Nick's
> mmap_sem patch.
> 
> However, despite being 66% idle, the test still runs over
> 3 times as fast!

Please quote the context switch rate when testing this stuff (I use vmstat 1).
I've seen it vary by a factor of 10,000 depending upon what's happening.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Eric Dumazet
On Thu, 05 Apr 2007 03:31:24 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:

> Jakub Jelinek wrote:
> 
> > My guess is that all the page zeroing is pretty expensive as well and
> > takes significant time, but I haven't profiled it.
> 
> With the attached patch (Andrew, I'll change the details around
> if you want - I just wanted something to test now), your test
> case run time went down considerably.
> 
> I modified the test case to only run 1000 loops, so it would run
> a bit faster on my system.  I also modified it to use MADV_DONTNEED
> to zap the pages, instead of the mmap(PROT_NONE) thing you use.
> 

Interesting...

Could you please add this patch and see if it helps on your machine ?

[PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem

Avoids cache line dirtying : The first cache line of mm_struct is/should_be 
mostly read.

In case find_vma() hits the cache, we dont need to access the begining of 
mm_struct.
Since we just dirtied mmap_sem, access to its cache line is free.

In case find_vma() misses the cache, we dont need to dirty the begining of 
mm_struct.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

--- linux-2.6.21-rc5/include/linux/sched.h
+++ linux-2.6.21-rc5-ed/include/linux/sched.h
@@ -310,7 +310,6 @@ typedef unsigned long mm_counter_t;
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
-   struct vm_area_struct * mmap_cache; /* last find_vma result */
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
@@ -324,6 +323,7 @@ struct mm_struct {
atomic_t mm_count;  /* How many references to 
"struct mm_struct" (users count as 1) */
int map_count;  /* number of VMAs */
struct rw_semaphore mmap_sem;
+   struct vm_area_struct * mmap_cache; /* last find_vma result */
spinlock_t page_table_lock; /* Protects page tables and 
some counters */
 
struct list_head mmlist;/* List of maybe swapped mm's.  
These are globally strung



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Rik van Riel wrote:


MADV_DONTNEED, unpatched, 1000 loops

real0m13.672s
user0m1.217s
sys 0m45.712s


MADV_DONTNEED, with patch, 1000 loops

real0m4.169s
user0m2.033s
sys 0m3.224s


I just noticed something fun with these numbers.

Without the patch, the system (a quad core CPU) is 10% idle.

With the patch, it is 66% idle - presumably I need Nick's
mmap_sem patch.

However, despite being 66% idle, the test still runs over
3 times as fast!

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Jakub Jelinek wrote:


My guess is that all the page zeroing is pretty expensive as well and
takes significant time, but I haven't profiled it.


With the attached patch (Andrew, I'll change the details around
if you want - I just wanted something to test now), your test
case run time went down considerably.

I modified the test case to only run 1000 loops, so it would run
a bit faster on my system.  I also modified it to use MADV_DONTNEED
to zap the pages, instead of the mmap(PROT_NONE) thing you use.


MADV_DONTNEED, unpatched, 1000 loops

real0m13.672s
user0m1.217s
sys 0m45.712s


MADV_DONTNEED, with patch, 1000 loops

real0m4.169s
user0m2.033s
sys 0m3.224s


--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.noarch/include/asm-alpha/mman.h.madvise	2007-04-04 16:44:50.0 -0400
+++ linux-2.6.20.noarch/include/asm-alpha/mman.h	2007-04-04 16:56:24.0 -0400
@@ -42,6 +42,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-generic/mman.h.madvise	2007-04-04 16:44:50.0 -0400
+++ linux-2.6.20.noarch/include/asm-generic/mman.h	2007-04-04 16:56:53.0 -0400
@@ -29,6 +29,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-mips/mman.h.madvise	2007-04-04 16:44:50.0 -0400
+++ linux-2.6.20.noarch/include/asm-mips/mman.h	2007-04-04 16:58:02.0 -0400
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-parisc/mman.h.madvise	2007-04-04 16:44:50.0 -0400
+++ linux-2.6.20.noarch/include/asm-parisc/mman.h	2007-04-04 16:58:40.0 -0400
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5   /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6   /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7  /* Inherit parents page size */
+#define MADV_FREE	8		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/asm-xtensa/mman.h.madvise	2007-04-04 16:44:51.0 -0400
+++ linux-2.6.20.noarch/include/asm-xtensa/mman.h	2007-04-04 16:59:14.0 -0400
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
--- linux-2.6.20.noarch/include/linux/mm_inline.h.madvise	2007-04-03 22:53:25.0 -0400
+++ linux-2.6.20.noarch/include/linux/mm_inline.h	2007-04-04 22:19:24.0 -0400
@@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(>lru, >inactive_list);
+	__inc_zone_state(zone, NR_INACTIVE);
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(>lru);
--- linux-2.6.20.noarch/include/linux/mm.h.madvise	2007-04-03 22:53:25.0 -0400
+++ linux-2.6.20.noarch/include/linux/mm.h	2007-04-04 22:06:45.0 -0400
@@ -716,6 +716,7 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
+	short madv_free;			/* MADV_FREE anonymous memory */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.20.noarch/include/linux/page-flags.h.madvise	2007-04-03 22:54:58.0 -0400
+++ linux-2.6.20.noarch/include/linux/page-flags.h	2007-04-05 01:27:38.0 -0400
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used 

Re: missing madvise functionality

2007-04-05 Thread Eric Dumazet

Ulrich Drepper a écrit :

Eric Dumazet wrote:

Database workload, where the user multi threaded app is constantly
accessing GBytes of data, so L2 cache hit is very small. If you want to
oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is
in the top 5.


We did have a workload with lots of Java and databases at some point
when many VMAs were the issue.  I brought this up here one, maybe two
years ago and I think Blaisorblade went on and looked into avoiding VMA
splits by having mprotect() not split VMAs and instead store the flags
in the page table somewhere.  I don't remember the details.

Nothing came out of this but if this is possible it would be yet another
way to avoid mmap_sem locking, right?



I was speaking about oprofile needs, that may interfere with target process 
needs, since oprofile calls find_vma() on the target process mm and thus zap 
its mmap_cache.


oprofile is yet another mmap_sem user, but also a mmap_cache destroyer.

We could at least have a separate cache, only for oprofile.

If done correctly we might avoid taking mmap_sem when the same vm_area_struct 
contains EIP/RIP snapshots.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Jakub Jelinek
On Thu, Apr 05, 2007 at 03:31:24AM -0400, Rik van Riel wrote:
> >My guess is that all the page zeroing is pretty expensive as well and
> >takes significant time, but I haven't profiled it.
> 
> With the attached patch (Andrew, I'll change the details around
> if you want - I just wanted something to test now), your test
> case run time went down considerably.

Thanks.

--- linux-2.6.20.noarch/mm/madvise.c.madvise2007-04-03 21:53:47.0 
-0400
+++ linux-2.6.20.noarch/mm/madvise.c2007-04-04 23:48:34.0 -0400
@@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a
.last_index = ULONG_MAX,
};
zap_page_range(vma, start, end - start, );
-   } else
-   zap_page_range(vma, start, end - start, NULL);
+   } else {
+   struct zap_details details = {
+   .madv_free = 1,
+   };
+   zap_page_range(vma, start, end - start, );
+   }
return 0;
 }
 
@@ -209,7 +213,9 @@ madvise_vma(struct vm_area_struct *vma, 
error = madvise_willneed(vma, prev, start, end);
break;
 
+   /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
case MADV_DONTNEED:
+   case MADV_FREE:
error = madvise_dontneed(vma, prev, start, end);
break;
 
I think you should only use the new behavior for madvise MADV_FREE, not for
MADV_DONTNEED.  The current MADV_DONTNEED behavior (which conflicts with
POSIX POSIX_MADV_DONTNEED, but that doesn't matter since what glibc
maps posix_madvise POSIX_MADV_DONTNEED in madvise call if anything doesn't
have to be MADV_DONTNEED, but can be anything else) is apparently documented
in Linux man pages:
   MADV_DONTNEED
  Do not expect access in the near future.  (For the time being, 
the application is finished with  the
  given  range, so the kernel can free resources associated with 
it.)  Subsequent accesses of pages in
  this range will succeed, but will result either in re-loading of 
the memory contents from the under-
  lying mapped file (see mmap()) or zero-fill-on-demand pages for 
mappings without an underlying file.
so it wouldn't surprise me if something relied on zero filling.
So IMHO madv_free in details should be only set if MADV_FREE.

Also, I think MADV_FREE shouldn't do anything at all (i.e. don't call
zap_page_range, but don't fail either) for shared or file backed vmas,
only for private anon memory it should do something.  After all, it
is just an optimization and it makes sense only for private anon mappings.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Eric Dumazet
On Thu, 05 Apr 2007 04:31:55 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:

> Eric Dumazet wrote:
> 
> > Could you please add this patch and see if it helps on your machine ?
> > 
> > [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem
> > 
> > Avoids cache line dirtying
> 
> I could, but I already know it's not going to help much.
> 
> How do I know this?  I already have 66% idle time when running
> with my patch (and without Nick Piggin's patch to take the
> mmap_sem for reading only).  Interestingly, despite the idle
> time increasing from 10% to 66%, throughput triples...
> 
> Saving some CPU time will probably only increase the idle time,
> I see no reason your patch would reduce contention and increase
> throughput.
> 
> I'm not saying your patch doesn't make sense - it probably does.
> I just suspect it would have zero impact on this particular
> scenario, because of the already huge idle time.

I know your cpus have idle time, that not the question.

But *when* your cpus are not idle, they might be slowed down because of cache 
line transferts between them. This patch doesnt reduce contention, just 
latencies (and overall performance)

I dont currently have SMP test machine, so I couldnt test it myself.

On x86_64, I am pretty sure the patch would help, because offsetof(mmap_sem) = 
0x60
On i386, offsetof(mmap_sem)=0x34, so this patch wont help.

As you said, throughput can raise and idle time raise too.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Ulrich Drepper
Eric Dumazet wrote:
> Database workload, where the user multi threaded app is constantly
> accessing GBytes of data, so L2 cache hit is very small. If you want to
> oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is
> in the top 5.

We did have a workload with lots of Java and databases at some point
when many VMAs were the issue.  I brought this up here one, maybe two
years ago and I think Blaisorblade went on and looked into avoiding VMA
splits by having mprotect() not split VMAs and instead store the flags
in the page table somewhere.  I don't remember the details.

Nothing came out of this but if this is possible it would be yet another
way to avoid mmap_sem locking, right?

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: missing madvise functionality

2007-04-05 Thread Eric Dumazet

Nick Piggin a écrit :

Eric Dumazet wrote:
>> This was not a working patch, just to throw the idea, since the

answers I got showed I was not understood.

In this case, find_extend_vma() should of course have one struct 
vm_area_cache * argument, like find_vma()


One single cache on one mm is not scalable. oprofile badly hits it on 
a dual cpu config.


Oh, what sort of workload are you using to show this? The only reason 
that I

didn't submit my thread cache patches was that I didn't show a big enough
improvement.



Database workload, where the user multi threaded app is constantly accessing 
GBytes of data, so L2 cache hit is very small. If you want to oprofile it, 
with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is in the top 5.


Each time oprofile has an NMI, it calls find_vma(EIP/RIP) and blows out the 
target process cache (usually plugged on the data vma containing user land 
futexes). Event with private futexes, it will probably be plugged on the brk() 
vma.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Nick Piggin wrote:


Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).


Ironically, your patch decreases throughput on my quad core
test system, with Jakub's test case.

MADV_DONTNEED, my patch, 1 loops  (14k context switches/second)

real0m34.890s
user0m17.256s
sys 0m29.797s


MADV_DONTNEED, my patch  your patch, 1 loops  (50 context 
switches/second)


real1m8.321s
user0m20.840s
sys 1m55.677s

I suspect it's moving the contention onto the page table lock,
in zap_pte_range().  I guess that the thread private memory
areas must be living right next to each other, in the same
page table lock regions :)

For more real world workloads, like the MySQL sysbench one,
I still suspect that your patch would improve things.

Time to move back to debugging other stuff, though.

Andrew, it would be nice if our patches could cook in -mm
for a while.  Want me to change anything before submitting?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-05 Thread Ingo Molnar

* David Howells [EMAIL PROTECTED] wrote:

 But short of recording the lock sequence, I don't think there's anyway 
 to find out for sure.  printk probably won't cut it as a recording 
 mechanism because its overheads are too great.

getting a good trace of it is easy: pick up the latest -rt kernel from:

http://redhat.com/~mingo/realtime-preempt/

enable EVENT_TRACING in that kernel, run the workload 
and do:

scripts/trace-it  to-ingo.txt

and send me the output. It will be large but interesting. That should 
get us a whole lot closer to what happens. A (much!) more finegrained 
result would be to also enable FUNCTION_TRACING and to do:

echo 1  /proc/sys/kernel/mcount_enabled

before running trace-it.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-05 Thread Andrew Morton
On Thu, 05 Apr 2007 13:48:58 +0100
David Howells [EMAIL PROTECTED] wrote:

 Andrew Morton [EMAIL PROTECTED] wrote:
 
  
  What we effectively have is 32 threads on a single CPU all doing
  
  for (ever) {
  down_write()
  up_write()
  down_read()
  up_read();
  }
 
 That's not quite so.  In that test program, most loops do two d/u writes and
 then a slew of d/u reads with virtually no delay between them.  One of the
 write-locked periods possibly lasts a relatively long time (it frees a bunch
 of pages), and the read-locked periods last a potentially long time (have to
 allocate a page).

Whatever.  I think it is still the case that the queueing behaviour of
rwsems causes us to get into this abababababab scenario.  And a single,
sole, solitary cond_resched() is sufficient to trigger the whole process
happening, and once it has started, it is sustained.

 If they weren't, you'd have to expect writer starvation in this situation.  As
 it is, you're guaranteed progress on all threads.
 
  CONFIG_PREEMPT_VOLUNTARY=y
 
 Which means the periods of lock-holding can be extended by preemption of the
 lock holder(s), making the whole situation that much worse.  You have to
 remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex.

Of course - the same thing happens with CONFIG_PREEMPT=y.

  I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
  causes 160,000 context switches per second and takes 9.5 seconds (after
  s/10/1000).
 
 How about if you have a UP kernel?  (ie: spinlocks - nops)

dunno.

  the context switch rate falls to zilch and total runtime falls to 6.4
  seconds.
 
 I presume you don't mean literally zero.

I do.  At least, I was unable to discern any increase in the context-switch
column in the `vmstat 1' output.

  If that cond_resched() was not there, none of this would ever happen - each
  thread merrily chugs away doing its ups and downs until it expires its
  timeslice.  Interesting, in a sad sort of way.
 
 The trouble is, I think, that you spend so much more time holding (or
 attempting to hold) locks than not, and preemption just exacerbates things.

No.  Preemption *triggers* things.  We're talking about an increase in
context switch rate by a factor of at least 10,000.  Something changed in a
fundamental way.

 I suspect that the reason the problem doesn't seem so obvious when you've got
 8 CPUs crunching their way through at once is probably because you can make
 progress on several read loops simultaneously fast enough that the preemption
 is lost in the things having to stop to give everyone writelocks.

The context switch rate is enormous on SMP on all kernel configs.  Perhaps
a better way of looking at it is to observe that the special case of a
single processor running a non-preemptible kernel simply got lucky.

 But short of recording the lock sequence, I don't think there's anyway to find
 out for sure.  printk probably won't cut it as a recording mechanism because
 its overheads are too great.

I think any code sequence which does

for ( ; ; ) {
down_write()
up_write()
down_read()
up_read()
}

is vulnerable to the artifact which I described.


I don't think we can (or should) do anything about it at the lock
implementation level.  It's more a matter of being aware of the possible
failure modes of rwsems, and being more careful to avoid that situation in
the code which uses rwsems.  And, of course, being careful about when and
where we use rwsems as opposed to other types of locks.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-05 Thread Andrew Morton
On Thu, 5 Apr 2007 21:11:29 +0200
Ingo Molnar [EMAIL PROTECTED] wrote:

 
 * David Howells [EMAIL PROTECTED] wrote:
 
  But short of recording the lock sequence, I don't think there's anyway 
  to find out for sure.  printk probably won't cut it as a recording 
  mechanism because its overheads are too great.
 
 getting a good trace of it is easy: pick up the latest -rt kernel from:
 
   http://redhat.com/~mingo/realtime-preempt/
 
 enable EVENT_TRACING in that kernel, run the workload 
 and do:
 
   scripts/trace-it  to-ingo.txt
 
 and send me the output.

Did that - no output was generated.  config at
http://userweb.kernel.org/~akpm/config-akpm2.txt

 It will be large but interesting. That should 
 get us a whole lot closer to what happens. A (much!) more finegrained 
 result would be to also enable FUNCTION_TRACING and to do:
 
   echo 1  /proc/sys/kernel/mcount_enabled
 
 before running trace-it.

Did that - still no output.

I did get an interesting dmesg spew:
http://userweb.kernel.org/~akpm/dmesg-akpm2.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Andrew Morton
On Thu, 05 Apr 2007 14:38:30 -0400
Rik van Riel [EMAIL PROTECTED] wrote:

 Nick Piggin wrote:
 
  Oh, also: something like this patch would help out MADV_DONTNEED, as it
  means it can run concurrently with page faults. I think the locking will
  work (but needs forward porting).
 
 Ironically, your patch decreases throughput on my quad core
 test system, with Jakub's test case.
 
 MADV_DONTNEED, my patch, 1 loops  (14k context switches/second)
 
 real0m34.890s
 user0m17.256s
 sys 0m29.797s
 
 
 MADV_DONTNEED, my patch  your patch, 1 loops  (50 context 
 switches/second)
 
 real1m8.321s
 user0m20.840s
 sys 1m55.677s
 
 I suspect it's moving the contention onto the page table lock,
 in zap_pte_range().  I guess that the thread private memory
 areas must be living right next to each other, in the same
 page table lock regions :)

Remember that we have two different ways of doing that locking:


#if NR_CPUS = CONFIG_SPLIT_PTLOCK_CPUS
/*
 * We tuck a spinlock to guard each pagetable page into its struct page,
 * at page-private, with BUILD_BUG_ON to make sure that this will not
 * overflow into the next struct page (as it might with DEBUG_SPINLOCK).
 * When freeing, reset page-mapping so free_pages_check won't complain.
 */
#define __pte_lockptr(page) ((page)-ptl)
#define pte_lock_init(_page)do {\
spin_lock_init(__pte_lockptr(_page));   \
} while (0)
#define pte_lock_deinit(page)   ((page)-mapping = NULL)
#define pte_lockptr(mm, pmd)({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
#else
/*
 * We use mm-page_table_lock to guard all pagetable pages of the mm.
 */
#define pte_lock_init(page) do {} while (0)
#define pte_lock_deinit(page)   do {} while (0)
#define pte_lockptr(mm, pmd)({(void)(pmd); (mm)-page_table_lock;})
#endif /* NR_CPUS  CONFIG_SPLIT_PTLOCK_CPUS */


I wonder which way you're using, and whether using the other way changes
things.


 For more real world workloads, like the MySQL sysbench one,
 I still suspect that your patch would improve things.
 
 Time to move back to debugging other stuff, though.
 
 Andrew, it would be nice if our patches could cook in -mm
 for a while.  Want me to change anything before submitting?

umm.  I took a quick squint at a patch from you this morning and it looked
OK to me.  Please send the finalish thing when it is fully baked and
performance-tested in the various regions of operation, thanks.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Andrew Morton wrote:


#if NR_CPUS = CONFIG_SPLIT_PTLOCK_CPUS



I wonder which way you're using, and whether using the other way changes
things.


I'm using the default Fedora config file, which has
NR_CPUS defined to 64 and CONFIG_SPLIT_PTLOCK_CPUS
to 4, so I am using the split locks.

However, I suspect that each 512kB malloced area
will share one page table lock with 4 others, so
some contention is to be expected.


For more real world workloads, like the MySQL sysbench one,
I still suspect that your patch would improve things.

Time to move back to debugging other stuff, though.

Andrew, it would be nice if our patches could cook in -mm
for a while.  Want me to change anything before submitting?


umm.  I took a quick squint at a patch from you this morning and it looked
OK to me.  Please send the finalish thing when it is fully baked and
performance-tested in the various regions of operation, thanks.


Will do.

Ulrich has a test version of glibc available that
uses MADV_DONTNEED for free(3), that should test
this thing nicely.

I'll run some tests with that when I get the
time, hopefully next week.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Nick Piggin

Rik van Riel wrote:

Nick Piggin wrote:


Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).



Ironically, your patch decreases throughput on my quad core
test system, with Jakub's test case.

MADV_DONTNEED, my patch, 1 loops  (14k context switches/second)

real0m34.890s
user0m17.256s
sys 0m29.797s


MADV_DONTNEED, my patch  your patch, 1 loops  (50 context 
switches/second)


real1m8.321s
user0m20.840s
sys 1m55.677s

I suspect it's moving the contention onto the page table lock,
in zap_pte_range().  I guess that the thread private memory
areas must be living right next to each other, in the same
page table lock regions :)

For more real world workloads, like the MySQL sysbench one,
I still suspect that your patch would improve things.


I think it definitely would, because the app will be wanting to
do other things with mmap_sem as well (like futexes *grumble*).

Also, the test case is allocating and freeing 512K chunks, which
I think would be on the high side of typical.

You have 32 threads for 4 CPUs, so then it would actually make
sense to context switch on mmap_sem write lock rather than spin
on ptl. But the kernel doesn't know that.

Testing with a small chunk size or thread == CPUs I think would
show a swing toward my patch.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Nick Piggin

Ulrich Drepper wrote:

In case somebody wants to play around with Rik patch or another
madvise-based patch, I have x86-64 glibc binaries which can use it:

  http://people.redhat.com/drepper/rpms

These are based on the latest Fedora rawhide version.  They should work
on older systems, too, but you screw up your updates.  Use them only if
you know what you do.

By default madvise(MADV_DONTNEED) is used.  With the environment variable


Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
mmap/mprotect, which have more fundamental locking requirements, more
overhead and no benefits (except debugging, I suppose).

MADV_DONTNEED is twice as fast in single threaded performance, and an
order of magnitude faster for multiple threads, when MADV_DONTNEED only
takes mmap_sem for read.

Do you plan to include this change in general glibc releases? Maybe it
will make google malloc obsolete? ;) (I don't suppose you'd be able to
get any tests done, Andrew?)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Ulrich Drepper
Nick Piggin wrote:
 Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
 kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
 mmap/mprotect, which have more fundamental locking requirements, more
 overhead and no benefits (except debugging, I suppose).

It's a tiny bit faster, see

  http://people.redhat.com/drepper/dontneed.png

I just ran it once so the graph is not smooth.  This is on a UP dual
core machine.  Maybe tomorrow I'll turn on the big 4p machine.

I would have to see dramatically different results on the big machine to
make me change the libc code.  The reason is that there is a big drawback.

So far, when we allocate a new arena, we allocate address space with
PROT_NONE and only when we need memory the protection is changed to
PROT_READ|PROT_WRITE.  This is the advantage of catching wild pointer
accesses.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: missing madvise functionality

2007-04-05 Thread Nick Piggin

Ulrich Drepper wrote:

Nick Piggin wrote:


Cool. According to my thinking, madvise(MADV_DONTNEED) even in today's
kernels using down_write(mmap_sem) for MADV_DONTNEED is better than
mmap/mprotect, which have more fundamental locking requirements, more
overhead and no benefits (except debugging, I suppose).



It's a tiny bit faster, see

  http://people.redhat.com/drepper/dontneed.png

I just ran it once so the graph is not smooth.  This is on a UP dual
core machine.  Maybe tomorrow I'll turn on the big 4p machine.


Hmm, I saw an improvement, but that was just on a raw syscall test
with a single page chunk. Real-world use I guess will get progressively
less dramatic as other overheads start being introduced.

Multi-thread performance probably won't get a whole lot better (it does
eliminate 1 down_write(mmap_sem), but one remains) until you use my
madvise patch.



I would have to see dramatically different results on the big machine to
make me change the libc code.  The reason is that there is a big drawback.

So far, when we allocate a new arena, we allocate address space with
PROT_NONE and only when we need memory the protection is changed to
PROT_READ|PROT_WRITE.  This is the advantage of catching wild pointer
accesses.


Sure, yes. And I guess you'd always want to keep that options around as
a debugging aid.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Ulrich Drepper
Eric Dumazet wrote:
 Database workload, where the user multi threaded app is constantly
 accessing GBytes of data, so L2 cache hit is very small. If you want to
 oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is
 in the top 5.

We did have a workload with lots of Java and databases at some point
when many VMAs were the issue.  I brought this up here one, maybe two
years ago and I think Blaisorblade went on and looked into avoiding VMA
splits by having mprotect() not split VMAs and instead store the flags
in the page table somewhere.  I don't remember the details.

Nothing came out of this but if this is possible it would be yet another
way to avoid mmap_sem locking, right?

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: missing madvise functionality

2007-04-05 Thread Eric Dumazet

Nick Piggin a écrit :

Eric Dumazet wrote:
 This was not a working patch, just to throw the idea, since the

answers I got showed I was not understood.

In this case, find_extend_vma() should of course have one struct 
vm_area_cache * argument, like find_vma()


One single cache on one mm is not scalable. oprofile badly hits it on 
a dual cpu config.


Oh, what sort of workload are you using to show this? The only reason 
that I

didn't submit my thread cache patches was that I didn't show a big enough
improvement.



Database workload, where the user multi threaded app is constantly accessing 
GBytes of data, so L2 cache hit is very small. If you want to oprofile it, 
with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is in the top 5.


Each time oprofile has an NMI, it calls find_vma(EIP/RIP) and blows out the 
target process cache (usually plugged on the data vma containing user land 
futexes). Event with private futexes, it will probably be plugged on the brk() 
vma.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Jakub Jelinek
On Thu, Apr 05, 2007 at 03:31:24AM -0400, Rik van Riel wrote:
 My guess is that all the page zeroing is pretty expensive as well and
 takes significant time, but I haven't profiled it.
 
 With the attached patch (Andrew, I'll change the details around
 if you want - I just wanted something to test now), your test
 case run time went down considerably.

Thanks.

--- linux-2.6.20.noarch/mm/madvise.c.madvise2007-04-03 21:53:47.0 
-0400
+++ linux-2.6.20.noarch/mm/madvise.c2007-04-04 23:48:34.0 -0400
@@ -142,8 +142,12 @@ static long madvise_dontneed(struct vm_a
.last_index = ULONG_MAX,
};
zap_page_range(vma, start, end - start, details);
-   } else
-   zap_page_range(vma, start, end - start, NULL);
+   } else {
+   struct zap_details details = {
+   .madv_free = 1,
+   };
+   zap_page_range(vma, start, end - start, details);
+   }
return 0;
 }
 
@@ -209,7 +213,9 @@ madvise_vma(struct vm_area_struct *vma, 
error = madvise_willneed(vma, prev, start, end);
break;
 
+   /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
case MADV_DONTNEED:
+   case MADV_FREE:
error = madvise_dontneed(vma, prev, start, end);
break;
 
I think you should only use the new behavior for madvise MADV_FREE, not for
MADV_DONTNEED.  The current MADV_DONTNEED behavior (which conflicts with
POSIX POSIX_MADV_DONTNEED, but that doesn't matter since what glibc
maps posix_madvise POSIX_MADV_DONTNEED in madvise call if anything doesn't
have to be MADV_DONTNEED, but can be anything else) is apparently documented
in Linux man pages:
   MADV_DONTNEED
  Do not expect access in the near future.  (For the time being, 
the application is finished with  the
  given  range, so the kernel can free resources associated with 
it.)  Subsequent accesses of pages in
  this range will succeed, but will result either in re-loading of 
the memory contents from the under-
  lying mapped file (see mmap()) or zero-fill-on-demand pages for 
mappings without an underlying file.
so it wouldn't surprise me if something relied on zero filling.
So IMHO madv_free in details should be only set if MADV_FREE.

Also, I think MADV_FREE shouldn't do anything at all (i.e. don't call
zap_page_range, but don't fail either) for shared or file backed vmas,
only for private anon memory it should do something.  After all, it
is just an optimization and it makes sense only for private anon mappings.

Jakub
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Eric Dumazet
On Thu, 05 Apr 2007 04:31:55 -0400
Rik van Riel [EMAIL PROTECTED] wrote:

 Eric Dumazet wrote:
 
  Could you please add this patch and see if it helps on your machine ?
  
  [PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem
  
  Avoids cache line dirtying
 
 I could, but I already know it's not going to help much.
 
 How do I know this?  I already have 66% idle time when running
 with my patch (and without Nick Piggin's patch to take the
 mmap_sem for reading only).  Interestingly, despite the idle
 time increasing from 10% to 66%, throughput triples...
 
 Saving some CPU time will probably only increase the idle time,
 I see no reason your patch would reduce contention and increase
 throughput.
 
 I'm not saying your patch doesn't make sense - it probably does.
 I just suspect it would have zero impact on this particular
 scenario, because of the already huge idle time.

I know your cpus have idle time, that not the question.

But *when* your cpus are not idle, they might be slowed down because of cache 
line transferts between them. This patch doesnt reduce contention, just 
latencies (and overall performance)

I dont currently have SMP test machine, so I couldnt test it myself.

On x86_64, I am pretty sure the patch would help, because offsetof(mmap_sem) = 
0x60
On i386, offsetof(mmap_sem)=0x34, so this patch wont help.

As you said, throughput can raise and idle time raise too.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Jakub Jelinek wrote:


My guess is that all the page zeroing is pretty expensive as well and
takes significant time, but I haven't profiled it.


With the attached patch (Andrew, I'll change the details around
if you want - I just wanted something to test now), your test
case run time went down considerably.

I modified the test case to only run 1000 loops, so it would run
a bit faster on my system.  I also modified it to use MADV_DONTNEED
to zap the pages, instead of the mmap(PROT_NONE) thing you use.


MADV_DONTNEED, unpatched, 1000 loops

real0m13.672s
user0m1.217s
sys 0m45.712s


MADV_DONTNEED, with patch, 1000 loops

real0m4.169s
user0m2.033s
sys 0m3.224s


--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.noarch/include/asm-alpha/mman.h.madvise	2007-04-04 16:44:50.0 -0400
+++ linux-2.6.20.noarch/include/asm-alpha/mman.h	2007-04-04 16:56:24.0 -0400
@@ -42,6 +42,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages  resources */
--- linux-2.6.20.noarch/include/asm-generic/mman.h.madvise	2007-04-04 16:44:50.0 -0400
+++ linux-2.6.20.noarch/include/asm-generic/mman.h	2007-04-04 16:56:53.0 -0400
@@ -29,6 +29,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages  resources */
--- linux-2.6.20.noarch/include/asm-mips/mman.h.madvise	2007-04-04 16:44:50.0 -0400
+++ linux-2.6.20.noarch/include/asm-mips/mman.h	2007-04-04 16:58:02.0 -0400
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages  resources */
--- linux-2.6.20.noarch/include/asm-parisc/mman.h.madvise	2007-04-04 16:44:50.0 -0400
+++ linux-2.6.20.noarch/include/asm-parisc/mman.h	2007-04-04 16:58:40.0 -0400
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5   /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6   /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7  /* Inherit parents page size */
+#define MADV_FREE	8		/* don't need the pages or the data */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages  resources */
--- linux-2.6.20.noarch/include/asm-xtensa/mman.h.madvise	2007-04-04 16:44:51.0 -0400
+++ linux-2.6.20.noarch/include/asm-xtensa/mman.h	2007-04-04 16:59:14.0 -0400
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* don't need the pages or the data */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages  resources */
--- linux-2.6.20.noarch/include/linux/mm_inline.h.madvise	2007-04-03 22:53:25.0 -0400
+++ linux-2.6.20.noarch/include/linux/mm_inline.h	2007-04-04 22:19:24.0 -0400
@@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z
 }
 
 static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+	list_add_tail(page-lru, zone-inactive_list);
+	__inc_zone_state(zone, NR_INACTIVE);
+}
+
+static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(page-lru);
--- linux-2.6.20.noarch/include/linux/mm.h.madvise	2007-04-03 22:53:25.0 -0400
+++ linux-2.6.20.noarch/include/linux/mm.h	2007-04-04 22:06:45.0 -0400
@@ -716,6 +716,7 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page-index to unmap */
 	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
+	short madv_free;			/* MADV_FREE anonymous memory */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.20.noarch/include/linux/page-flags.h.madvise	2007-04-03 22:54:58.0 -0400
+++ linux-2.6.20.noarch/include/linux/page-flags.h	2007-04-05 01:27:38.0 -0400
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* 

Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Eric Dumazet wrote:


Could you please add this patch and see if it helps on your machine ?

[PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem

Avoids cache line dirtying


I could, but I already know it's not going to help much.

How do I know this?  I already have 66% idle time when running
with my patch (and without Nick Piggin's patch to take the
mmap_sem for reading only).  Interestingly, despite the idle
time increasing from 10% to 66%, throughput triples...

Saving some CPU time will probably only increase the idle time,
I see no reason your patch would reduce contention and increase
throughput.

I'm not saying your patch doesn't make sense - it probably does.
I just suspect it would have zero impact on this particular
scenario, because of the already huge idle time.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Eric Dumazet

Ulrich Drepper a écrit :

Eric Dumazet wrote:

Database workload, where the user multi threaded app is constantly
accessing GBytes of data, so L2 cache hit is very small. If you want to
oprofile it, with say a CPU_CLK_UNHALTED:5000 event, then find_vma() is
in the top 5.


We did have a workload with lots of Java and databases at some point
when many VMAs were the issue.  I brought this up here one, maybe two
years ago and I think Blaisorblade went on and looked into avoiding VMA
splits by having mprotect() not split VMAs and instead store the flags
in the page table somewhere.  I don't remember the details.

Nothing came out of this but if this is possible it would be yet another
way to avoid mmap_sem locking, right?



I was speaking about oprofile needs, that may interfere with target process 
needs, since oprofile calls find_vma() on the target process mm and thus zap 
its mmap_cache.


oprofile is yet another mmap_sem user, but also a mmap_cache destroyer.

We could at least have a separate cache, only for oprofile.

If done correctly we might avoid taking mmap_sem when the same vm_area_struct 
contains EIP/RIP snapshots.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Andrew Morton
On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel [EMAIL PROTECTED] wrote:

 Rik van Riel wrote:
 
  MADV_DONTNEED, unpatched, 1000 loops
  
  real0m13.672s
  user0m1.217s
  sys 0m45.712s
  
  
  MADV_DONTNEED, with patch, 1000 loops
  
  real0m4.169s
  user0m2.033s
  sys 0m3.224s
 
 I just noticed something fun with these numbers.
 
 Without the patch, the system (a quad core CPU) is 10% idle.
 
 With the patch, it is 66% idle - presumably I need Nick's
 mmap_sem patch.
 
 However, despite being 66% idle, the test still runs over
 3 times as fast!

Please quote the context switch rate when testing this stuff (I use vmstat 1).
I've seen it vary by a factor of 10,000 depending upon what's happening.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Eric Dumazet
On Thu, 05 Apr 2007 03:31:24 -0400
Rik van Riel [EMAIL PROTECTED] wrote:

 Jakub Jelinek wrote:
 
  My guess is that all the page zeroing is pretty expensive as well and
  takes significant time, but I haven't profiled it.
 
 With the attached patch (Andrew, I'll change the details around
 if you want - I just wanted something to test now), your test
 case run time went down considerably.
 
 I modified the test case to only run 1000 loops, so it would run
 a bit faster on my system.  I also modified it to use MADV_DONTNEED
 to zap the pages, instead of the mmap(PROT_NONE) thing you use.
 

Interesting...

Could you please add this patch and see if it helps on your machine ?

[PATCH] VM : mm_struct's mmap_cache should be close to mmap_sem

Avoids cache line dirtying : The first cache line of mm_struct is/should_be 
mostly read.

In case find_vma() hits the cache, we dont need to access the begining of 
mm_struct.
Since we just dirtied mmap_sem, access to its cache line is free.

In case find_vma() misses the cache, we dont need to dirty the begining of 
mm_struct.


Signed-off-by: Eric Dumazet [EMAIL PROTECTED]

--- linux-2.6.21-rc5/include/linux/sched.h
+++ linux-2.6.21-rc5-ed/include/linux/sched.h
@@ -310,7 +310,6 @@ typedef unsigned long mm_counter_t;
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
-   struct vm_area_struct * mmap_cache; /* last find_vma result */
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
@@ -324,6 +323,7 @@ struct mm_struct {
atomic_t mm_count;  /* How many references to 
struct mm_struct (users count as 1) */
int map_count;  /* number of VMAs */
struct rw_semaphore mmap_sem;
+   struct vm_area_struct * mmap_cache; /* last find_vma result */
spinlock_t page_table_lock; /* Protects page tables and 
some counters */
 
struct list_head mmlist;/* List of maybe swapped mm's.  
These are globally strung



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Rik van Riel wrote:


MADV_DONTNEED, unpatched, 1000 loops

real0m13.672s
user0m1.217s
sys 0m45.712s


MADV_DONTNEED, with patch, 1000 loops

real0m4.169s
user0m2.033s
sys 0m3.224s


I just noticed something fun with these numbers.

Without the patch, the system (a quad core CPU) is 10% idle.

With the patch, it is 66% idle - presumably I need Nick's
mmap_sem patch.

However, despite being 66% idle, the test still runs over
3 times as fast!

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: preemption and rwsems (was: Re: missing madvise functionality)

2007-04-05 Thread David Howells
Andrew Morton [EMAIL PROTECTED] wrote:

 
 What we effectively have is 32 threads on a single CPU all doing
 
   for (ever) {
   down_write()
   up_write()
   down_read()
   up_read();
   }

That's not quite so.  In that test program, most loops do two d/u writes and
then a slew of d/u reads with virtually no delay between them.  One of the
write-locked periods possibly lasts a relatively long time (it frees a bunch
of pages), and the read-locked periods last a potentially long time (have to
allocate a page).

Though, to be fair, as long as you've got way more than 16MB of RAM, the
memory stuff shouldn't take too long, but the locks will be being held for a
long time compared to the periods when you're not holding a lock of any sort.

 and rwsems are fair.

If they weren't, you'd have to expect writer starvation in this situation.  As
it is, you're guaranteed progress on all threads.

 CONFIG_PREEMPT_VOLUNTARY=y

Which means the periods of lock-holding can be extended by preemption of the
lock holder(s), making the whole situation that much worse.  You have to
remember, you *can* be preempted whilst you hold a semaphore, rwsem or mutex.

 I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
 causes 160,000 context switches per second and takes 9.5 seconds (after
 s/10/1000).

How about if you have a UP kernel?  (ie: spinlocks - nops)

 the context switch rate falls to zilch and total runtime falls to 6.4
 seconds.

I presume you don't mean literally zero.

 If that cond_resched() was not there, none of this would ever happen - each
 thread merrily chugs away doing its ups and downs until it expires its
 timeslice.  Interesting, in a sad sort of way.

The trouble is, I think, that you spend so much more time holding (or
attempting to hold) locks than not, and preemption just exacerbates things.

I suspect that the reason the problem doesn't seem so obvious when you've got
8 CPUs crunching their way through at once is probably because you can make
progress on several read loops simultaneously fast enough that the preemption
is lost in the things having to stop to give everyone writelocks.

But short of recording the lock sequence, I don't think there's anyway to find
out for sure.  printk probably won't cut it as a recording mechanism because
its overheads are too great.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Andrew Morton wrote:

On Thu, 05 Apr 2007 03:39:29 -0400 Rik van Riel [EMAIL PROTECTED] wrote:


Rik van Riel wrote:


MADV_DONTNEED, unpatched, 1000 loops

real0m13.672s
user0m1.217s
sys 0m45.712s


MADV_DONTNEED, with patch, 1000 loops

real0m4.169s
user0m2.033s
sys 0m3.224s

I just noticed something fun with these numbers.

Without the patch, the system (a quad core CPU) is 10% idle.

With the patch, it is 66% idle - presumably I need Nick's
mmap_sem patch.

However, despite being 66% idle, the test still runs over
3 times as fast!


Please quote the context switch rate when testing this stuff (I use vmstat 1).
I've seen it vary by a factor of 10,000 depending upon what's happening.


About context switches 14000 per second.

I'll go compile in Nick's patch to see if that makes
things go faster.  I expect it will.

procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy 
id wa st
 1  0  0 965232 250024 37084800 0 0 1026 13914 13 
21 67  0  0
 1  0  0 965232 250024 37084800 0 0 1018 14654 12 
20 68  0  0
 1  0  0 965232 250024 37084800 0 0 1023 14006 12 
21 67  0  0



--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-05 Thread Ulrich Drepper
In case somebody wants to play around with Rik patch or another
madvise-based patch, I have x86-64 glibc binaries which can use it:

  http://people.redhat.com/drepper/rpms

These are based on the latest Fedora rawhide version.  They should work
on older systems, too, but you screw up your updates.  Use them only if
you know what you do.

By default madvise(MADV_DONTNEED) is used.  With the environment variable

  MALLOC_MADVISE

one can select a different hint.  The value of the envvar must be the
number of that other hint.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: missing madvise functionality

2007-04-05 Thread Rik van Riel

Jakub Jelinek wrote:


+   /* FIXME: POSIX says that MADV_DONTNEED cannot throw away data. */
case MADV_DONTNEED:
+   case MADV_FREE:
error = madvise_dontneed(vma, prev, start, end);
break;
 
I think you should only use the new behavior for madvise MADV_FREE, not for
MADV_DONTNEED. 


I will.  However, we need to double-use MADV_DONTNEED in this
patch for now, so Ulrich's test glibc can be used easily :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread William Lee Irwin III
On Wed, 4 Apr 2007 06:09:18 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> Oh dear.

On Wed, Apr 04, 2007 at 11:51:05AM -0700, Andrew Morton wrote:
> what's all this about?

I rewrote Jakub's testcase and included it as a MIME attachment.
Current working version inline below. Also at

http://holomorphy.com/~wli/jakub.c

The basic idea was that I wanted a few more niceties, such as specifying
the number of iterations and other things of that nature on the cmdline.
I threw in a little code reorganization and error checking, too.


-- wli


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

enum thread_return {
tr_success  =  0,
tr_mmap_init= -1,
tr_mmap_free= -2,
tr_mprotect = -3,
tr_madvise  = -4,
tr_unknown  = -5,
tr_munmap   = -6,
};

enum release_method {
release_by_mmap = 0,
release_by_madvise  = 1,
release_by_max  = 2,
};

struct thread_argument {
size_t page_size;
int iterations, pages_per_thread, nr_threads;
enum release_method method;
};

static enum thread_return mmap_release(void *p, size_t n)
{
void *q;

q = mmap(p, n, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
if (p != q) {
perror("thread_function: mmap release failed");
return tr_mmap_free;
}
if (mprotect(p, n, PROT_READ | PROT_WRITE)) {
perror("thread_function: mprotect failed");
return tr_mprotect;
}
return tr_success;
}

static enum thread_return madvise_release(void *p, size_t n)
{
if (madvise(p, n, MADV_DONTNEED)) {
perror("thread_function: madvise failed");
return tr_madvise;
}
return tr_success;
}

static enum thread_return (*release_methods[])(void *, size_t) = {
mmap_release,
madvise_release,
};

static void *thread_function(void *__arg)
{
char *p;
int i;
struct thread_argument *arg = __arg;
size_t arena_size = arg->pages_per_thread * arg->page_size;

p = (char *)mmap(NULL, arena_size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("thread_function: arena allocation failed");
return (void *)tr_mmap_init;
}
for (i = 0; i < arg->iterations; i++) {
size_t s;
char *q, *r;
enum thread_return ret;

/* Pretend to use the buffer.  */
r = p + arena_size;
for (q = p; q < r; q += arg->page_size)
*q = 1;
for (s = 0, q = p; q < r; q += arg->page_size)
s += *q;
if (arg->method >= release_by_max) {
perror("thread_function: "
"unknown freeing method specified");
return (void *)tr_unknown;
}
ret = (*release_methods[arg->method])(p, arena_size);
if (ret != tr_success)
return (void *)ret;
}
if (munmap(p, arena_size)) {
perror("thread_function: munmap() failed");
return (void *)tr_munmap;
}
return (void *)tr_success;
}

static int configure(struct thread_argument *arg, int argc, char *argv[])
{
char optstring[] = "t:m:i:p:";
int c, tmp, ret = 0;
long n;

n = sysconf(_SC_PAGE_SIZE);
if (n < 0) {
perror("configure: sysconf(_SC_PAGE_SIZE) failed");
ret = -1;
}
arg->nr_threads = 32, 
arg->page_size = (size_t)n;
arg->method = release_by_mmap;
arg->iterations = 10;
arg->pages_per_thread = 128;

while ((c = getopt(argc, argv, optstring)) != -1) {
switch (c) {
case 't':
if (sscanf(optarg, "%d", ) == 1)
arg->nr_threads = tmp;
else {
perror("configure: non-numeric thread 
count");
ret = -1;
}
break;
case 'm':
if (!strcmp(optarg, "mmap"))
arg->method = release_by_mmap;
else if (!strcmp(optarg, "madvise"))
arg->method = release_by_madvise;
else {
perror("configure: unrecognised release 
method");

Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Nick Piggin wrote:

Jakub Jelinek wrote:


On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote:


Does mmap(PROT_NONE) actually free the memory?




Yes.
/* Clear old maps */
error = -ENOMEM;
munmap_back:
vma = find_vma_prepare(mm, addr, , _link, _parent);
if (vma && vma->vm_start < addr + len) {
if (do_munmap(mm, addr, len))
return -ENOMEM;
goto munmap_back;
}



Thanks, I overlooked the mmap vs mprotect detail. So how are the subsequent
access faults avoided?


AFAIKS, the faults are not avoided. Not for single page allocations, not
for multi-page allocations.

So what glibc currently does to allocate, use, then deallocate a page is
this:
  mprotect(PROT_READ|PROT_WRITE) -> down_write(mmap_sem)
  touch page -> page fault -> down_read(mmap_sem)
  mmap(PROT_NONE) -> down_write(mmap_sem)

What it could be doing is:
  touch page -> page fault -> down_read(mmap_sem)
  madvise(MADV_DONTNEED) -> down_read(mmap_sem)

So after my previously posted patch (attached again) to only take down_read
in madvise where possible...

With 2 threads, the attached test.c ends up doing about 140,000 context
switches per second with just 2 threads/2CPUs, takes a little over 2
million faults, and about 80 seconds to complete, when running the
old_test() function (ie. mprotect,touch,mmap).

When running new_test() (ie. touch,madvise), context switches stay well
under 100, it takes slightly fewer faults, and it completes in about 8
seconds.

With 1 thread, new_test() actually completes in under half the time as
well (4.55 vs 9.88 seconds). This result won't have been altered by my
madvise patch, because the down_write fastpath is no slower than down_read.

Any comments?

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/madvise.c
===
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -12,6 +12,25 @@
 #include 
 
 /*
+ * Any behaviour which results in changes to the vma->vm_flags needs to
+ * take mmap_sem for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
+static int madvise_need_mmap_write(int behavior)
+{
+   switch (behavior) {
+   case MADV_DOFORK:
+   case MADV_DONTFORK:
+   case MADV_NORMAL:
+   case MADV_SEQUENTIAL:
+   case MADV_RANDOM:
+   return 1;
+   default:
+   return 0;
+   }
+}
+
+/*
  * We can potentially split a vm area into separate
  * areas, each area with its own behavior.
  */
@@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon
int error = -EINVAL;
size_t len;
 
-   down_write(>mm->mmap_sem);
+   if (madvise_need_mmap_write(behavior))
+   down_write(>mm->mmap_sem);
+   else
+   down_read(>mm->mmap_sem);
 
if (start & ~PAGE_MASK)
goto out;
@@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon
vma = prev->vm_next;
}
 out:
-   up_write(>mm->mmap_sem);
+   if (madvise_need_mmap_write(behavior))
+   up_write(>mm->mmap_sem);
+   else
+   up_read(>mm->mmap_sem);
+
return error;
 }
#include 
#include 
#include 
#include 

#define NR_THREADS	1
#define ITERS	100
#define HEAPSIZE	(4*1024)

static void *old_thread(void *heap)
{
	int i;

	for (i = 0; i < ITERS; i++) {
		char *mem = heap;
		if (mprotect(heap, HEAPSIZE, PROT_READ|PROT_WRITE) == -1)
			perror("mprotect"), exit(1);
		*mem = i;
		if (mmap(heap, HEAPSIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0) == MAP_FAILED)
			perror("mmap"), exit(1);
	}

	return NULL;
}

static void old_test(void)
{
	void *heap;
	pthread_t pt[NR_THREADS];
	int i;

	heap = mmap(NULL, NR_THREADS*HEAPSIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	if (heap == MAP_FAILED)
		perror("mmap"), exit(1);

	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_create([i], NULL, old_thread, heap + i*HEAPSIZE) == -1)
			perror("pthread_create"), exit(1);
	}
	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_join(pt[i], NULL) == -1)
			perror("pthread_join"), exit(1);
	}

	if (munmap(heap, NR_THREADS*HEAPSIZE) == -1)
		perror("munmap"), exit(1);
}

static void *new_thread(void *heap)
{
	int i;

	for (i = 0; i < ITERS; i++) {
		char *mem = heap;
		*mem = i;
		if (madvise(heap, HEAPSIZE, MADV_DONTNEED) == -1)
			perror("madvise"), exit(1);
	}

	return NULL;
}

static void new_test(void)
{
	void *heap;
	pthread_t pt[NR_THREADS];
	int i;

	heap = mmap(NULL, HEAPSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	if (heap == MAP_FAILED)
		perror("mmap"), exit(1);

	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_create([i], NULL, new_thread, heap + i*HEAPSIZE) == -1)
			perror("pthread_create"), exit(1);
	}
	for (i = 0; i < NR_THREADS; i++) {
		if (pthread_join(pt[i], NULL) == -1)
			perror("pthread_join"), exit(1);
	}

	if (munmap(heap, HEAPSIZE) == 

Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Eric Dumazet wrote:

On Wed, 04 Apr 2007 20:05:54 +1000
Nick Piggin <[EMAIL PROTECTED]> wrote:


@@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u
unsigned long start;

addr &= PAGE_MASK;
-   vma = find_vma(mm,addr);
+   vma = find_vma(mm,addr,>vmacache);
if (!vma)
return NULL;
if (vma->vm_start <= addr)


So now you can have current calling find_extend_vma on someone else's mm
but using their cache. So you're going to return current's vma, or current
is going to get one of mm's vmas in its cache :P



This was not a working patch, just to throw the idea, since the answers I got 
showed I was not understood.

In this case, find_extend_vma() should of course have one struct vm_area_cache 
* argument, like find_vma()

One single cache on one mm is not scalable. oprofile badly hits it on a dual 
cpu config.


Oh, what sort of workload are you using to show this? The only reason that I
didn't submit my thread cache patches was that I didn't show a big enough
improvement.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 4 Apr 2007, Rik van Riel wrote:


Hugh Dickins wrote:



(I didn't understand how Rik would achieve his point 5, _no_ lock
contention while repeatedly re-marking these pages, but never mind.)


The CPU marks them accessed when they are reused.

The VM only moves the reused pages back to the active list
on memory pressure.  This means that when the system is
not under memory pressure, the same page can simply stay
PG_lazyfree for multiple malloc/free rounds.



Sure, there's no need for repetitious locking at the LRU end of it;
but you said "if the system has lots of free memory, pages can go
through multiple free/malloc cycles while sitting on the dontneed
list, very lazily with no lock contention".  I took that to mean,
with userspace repeatedly madvising on the ranges they fall in,
which will involve mmap_sem and ptl each time - just in order
to check that no LRU movement is required each time.

(Of course, there's also the problem that we don't leave our
systems with lots of free memory: some LRU balancing decisions.)


I don't agree this approach is the best one anyway. I'd rather
just the simple MADV_DONTNEED/MADV_DONEED.

Once you go through the trouble of protecting the memory and
flushing TLBs, unprotecting them afterwards and taking a trap
(even if it is a pure hardware trap), I doubt you've saved much.

You may have saved the cost of zeroing out the page, but that
has to be weighed against the fact that you have left a possibly
cache hot page sitting there to get cold, and your accesses to
initialise the malloced memory might have more cache misses.

If you just free the page, it goes onto a nice LIFO cache hot
list, and when you want to allocate another one, you'll probably
get a cache hot one.

The problem is down_write(mmap_sem) isn't it? We can and should
easily fix that problem now. If we subsequently want to look at
micro optimisations to avoid zeroing using MMU tricks, then we
have a good base to compare with.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


preemption and rwsems (was: Re: missing madvise functionality)

2007-04-04 Thread Andrew Morton
On Tue, 3 Apr 2007 16:29:37 -0400
Jakub Jelinek <[EMAIL PROTECTED]> wrote:

> #include 
> #include 
> #include 
> #include 
> 
> void *
> tf (void *arg)
> {
>   (void) arg;
>   size_t ps = sysconf (_SC_PAGE_SIZE);
>   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>   if (p == MAP_FAILED)
> exit (1);
>   int i;
>   for (i = 0; i < 10; i++)
> {
>   /* Pretend to use the buffer.  */
>   char *q, *r = (char *) p + 128 * ps;
>   size_t s;
>   for (q = (char *) p; q < r; q += ps)
> *q = 1;
>   for (s = 0, q = (char *) p; q < r; q += ps)
> s += *q;
>   /* Free it.  Replace this mmap with
>  madvise (p, 128 * ps, MADV_THROWAWAY) when implemented.  */
>   if (mmap (p, 128 * ps, PROT_NONE,
> MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p)
> exit (2);
>   /* And immediately malloc again.  This would then be deleted.  */
>   if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE))
> exit (3);
> }
>   return NULL;
> }
> 
> int
> main (void)
> {
>   pthread_t th[32];
>   int i;
>   for (i = 0; i < 32; i++)
> if (pthread_create ([i], NULL, tf, NULL))
>   exit (4);
>   for (i = 0; i < 32; i++)
> pthread_join (th[i], NULL);
>   return 0;
> }

This little test app is fun.

I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
causes 160,000 context switches per second and takes 9.5 seconds (after
s/10/1000).

The kernel has

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

and when I switch that to

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

the context switch rate falls to zilch and total runtime falls to 6.4
seconds.

Presumably the same problem will occur with CONFIG_PREEMPT_VOLUNTARY on
uniprocessor kernels.



What we effectively have is 32 threads on a single CPU all doing

for (ever) {
down_write()
up_write()
down_read()
up_read();
}

and rwsems are "fair".  So

  thread A thread B

  down_write();

  cond_resched()
  ->schedule()

   down_read() -> blocks

  up_write()

  down_read()

  up_read()

  down_write() -> there's a reader: block

   down_read() -> succeeds

   up_read()

   down_write() -> there's another 
down_writer: block

  down_write() -> succeeds

  up_write()

  down_read() -> there's a down_writer: block

   down_write() succeeds

   up_write()

   down_read() -> succeeds

   up_read()

   down_write() -> there's a 
down_reader: block

  down_read() succeeds


ad nauseum.


If that cond_resched() was not there, none of this would ever happen - each
thread merrily chugs away doing its ups and downs until it expires its
timeslice.  Interesting, in a sad sort of way.



Setting CONFIG_PREEMPT_NONE doesn't appear to make any difference to
context switch rate or runtime when all eight CPUs are used, so this
phenomenon is unlikely to be involved in the mysql problem.

I wonder why a similar thing doesn't happen when more than one CPU is used.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Andrew Morton
On Wed, 04 Apr 2007 14:08:47 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> 
> > There are other ways of doing it - I guess we could use a new page flag to
> > indicate that this is one-of-those-pages, and add new code to handle it in
> > all the right places.
> 
> That's what I did.  I'm currently working on the
> zap_page_range() side of things.

Let's try to avoid consuming another page flag if poss, please.  Perhaps
use PAGE_MAPPING_ANON's neighbouring bit?

> > One thing which we haven't sorted out with all this stuff: once the
> > application has marked an address range (and some pages) as
> > whatever-were-going-call-this-feature, how does the application undo that
> > change? 
> 
> It doesn't have to do anything.  Just access the page and the
> MMU will mark it dirty/accessed and the VM will not reclaim
> it.

um, OK.  I suspect it would be good to clear the page's
PageWhateverWereGoingToCallThisThing() state when this happens.  Otherwise
when the page gets clean again (ie: added to swapcache then written out)
then it will look awfully similar to one of these new types of pages and
things might get confusing.  We'll see.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Andrew Morton
On Wed, 4 Apr 2007 06:09:18 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:

> 
> On Tue, Apr 03, 2007 at 04:29:37PM -0400, Jakub Jelinek wrote:
> > void *
> > tf (void *arg)
> > {
> >   (void) arg;
> >   size_t ps = sysconf (_SC_PAGE_SIZE);
> >   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
> >   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >   if (p == MAP_FAILED)
> > exit (1);
> >   int i;
> 
> Oh dear.

what's all this about?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Anton Blanchard

Hi,

> Oh.  I was assuming that we'd want to unmap these pages from pagetables and
> mark then super-easily-reclaimable.  So a later touch would incur a minor
> fault.
> 
> But you think that we should leave them mapped into pagetables so no such
> fault occurs.

That would be very nice. The issues are not limited to threaded apps,
we have seen performance problems with single threaded HPC applications
that do a lot of large malloc/frees. It turns out the continual set up
and tear down of pagetables when malloc uses mmap/free is a problem. At
the moment the workaround is:

export MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1

which forces glibc malloc to use brk instead of mmap/free. Of course brk
is good for keeping pagetables around but bad for keeping memory usage
down.

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Hugh Dickins
On Wed, 4 Apr 2007, Andrew Morton wrote:
> 
> The treatment is identical to clean swapcache pages, with the sole
> exception that they don't actually consume any swap space - hence the fake
> swapcache entry thing.

I see, sneaking through try_to_unmap's anon PageSwapCache assumptions
as simply as possible - thanks.

(Coincidentally, Andrea pointed to precisely the same issue in the
no PAGE_ZERO thread, when we were toying with writable but clean.)

> One thing which we haven't sorted out with all this stuff: once the
> application has marked an address range (and some pages) as
> whatever-were-going-call-this-feature, how does the application undo
> that change?

By re-referencing the pages.  (Hmm, so an incorrect app which accesses
"free"d areas, will undo it: well, okay, nothing terrible about that.)

> What effect will things like mremap, madvise and mlock have upon
> these pages?

mlock will undo the state in its make_pages_present: I guess that
should happen in or near follow_page's mark_page_accessed.

mremap?  Other madvises?  Nothing much at all: mremap can move
them around, and the madvises do whatever they do - I don't notice
any problem in that direction, but it'll be easier when we have an
implementation to poke at.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Rik van Riel

Andrew Morton wrote:


There are other ways of doing it - I guess we could use a new page flag to
indicate that this is one-of-those-pages, and add new code to handle it in
all the right places.


That's what I did.  I'm currently working on the
zap_page_range() side of things.


One thing which we haven't sorted out with all this stuff: once the
application has marked an address range (and some pages) as
whatever-were-going-call-this-feature, how does the application undo that
change? 


It doesn't have to do anything.  Just access the page and the
MMU will mark it dirty/accessed and the VM will not reclaim
it.


What effect will things like mremap, madvise and mlock have upon
these pages?


Good point.  I had not thought about these.

Would you mind if I sent an initial proof of concept
patch that does not take these into account, before
we decide on what should happen in these cases? :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Andrew Morton
On Wed, 4 Apr 2007 10:15:41 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote:

> On Tue, 3 Apr 2007, Andrew Morton wrote:
> > 
> > All of which indicates that if we can remove the down_write(mmap_sem) from
> > this glibc operation, things should get a lot better - there will be no
> > additional context switches at all.
> > 
> > And we can surely do that if all we're doing is looking up pageframes,
> > putting pages into fake-swapcache and moving them around on the page LRUs.
> > 
> > Hugh?  Sanity check?
> 
> Setting aside the fake-swapcache part, yes, Rik should be able to do what
> Ulrich wants (operating on ptes and pages) without down_write(mmap_sem):
> just needing down_read(mmap_sem) to keep the whole vma/pagetable structure
> stable, and page table lock (literal or per-page-table) for each contents.
> 
> (I didn't understand how Rik would achieve his point 5, _no_ lock
> contention while repeatedly re-marking these pages, but never mind.)
> 
> (Some mails in this thread overlook that we also use down_write(mmap_sem)
> to guard simple things like vma->vm_flags: of course that in itself could
> be manipulated with atomics, or spinlock; but like many of the vma fields,
> changing it goes hand in hand with the chance that we have to split vma,
> which does require the heavy-handed down_write(mmap_sem).  I expect that
> splitting those uses apart would be harder than first appears, and better
> to go for a more radical redesign - I don't know what.)
> 
> But you lose me with the fake-swapcache part of it: that came, I think,
> from your initial idea that it would be okay to refault on these ptes.
> Don't we all agree now that we'd prefer not to refault on those ptes,
> unless some memory pressure has actually decided to pull them out?
> (Hmm, yet more list balancing...)

The way in which we want to treat these pages is (I believe) to keep them
if there's not a lot of memory pressure, but to reclaim them "easily" if
there is some memory pressure.

A simple way to do that is to move them onto the inactive list.  But how do
we handle these pages when the vm scanner encounters them?

The treatment is identical to clean swapcache pages, with the sole
exception that they don't actually consume any swap space - hence the fake
swapcache entry thing.

There are other ways of doing it - I guess we could use a new page flag to
indicate that this is one-of-those-pages, and add new code to handle it in
all the right places.



One thing which we haven't sorted out with all this stuff: once the
application has marked an address range (and some pages) as
whatever-were-going-call-this-feature, how does the application undo that
change?  What effect will things like mremap, madvise and mlock have upon
these pages?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Hugh Dickins
On Wed, 4 Apr 2007, Rik van Riel wrote:
> Hugh Dickins wrote:
> 
> > (I didn't understand how Rik would achieve his point 5, _no_ lock
> > contention while repeatedly re-marking these pages, but never mind.)
> 
> The CPU marks them accessed when they are reused.
> 
> The VM only moves the reused pages back to the active list
> on memory pressure.  This means that when the system is
> not under memory pressure, the same page can simply stay
> PG_lazyfree for multiple malloc/free rounds.

Sure, there's no need for repetitious locking at the LRU end of it;
but you said "if the system has lots of free memory, pages can go
through multiple free/malloc cycles while sitting on the dontneed
list, very lazily with no lock contention".  I took that to mean,
with userspace repeatedly madvising on the ranges they fall in,
which will involve mmap_sem and ptl each time - just in order
to check that no LRU movement is required each time.

(Of course, there's also the problem that we don't leave our
systems with lots of free memory: some LRU balancing decisions.)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Rik van Riel

Hugh Dickins wrote:


(I didn't understand how Rik would achieve his point 5, _no_ lock
contention while repeatedly re-marking these pages, but never mind.)


The CPU marks them accessed when they are reused.

The VM only moves the reused pages back to the active list
on memory pressure.  This means that when the system is
not under memory pressure, the same page can simply stay
PG_lazyfree for multiple malloc/free rounds.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Hugh Dickins
On Wed, 4 Apr 2007, Marko Macek wrote:
> Ulrich Drepper wrote:
> > A solution for this problem is a madvise() operation with the following
> > property:
> > 
> >   - the content of the address range can be discarded
> > 
> >   - if an access to a page in the range happens in the future it must
> > succeed.  The old page content can be provided or a new, empty page
> > can be provided
> 
> Doesn't this conflict with disabling overcommit?
> 
> If the page is guaranteed to be available, obviously it must count as
> being commited, so this is not equivalent to real freeing.

No, there's no conflict with disabled overcommit here: Committed_AS
accounting is done on the whole vma size (at mmap or brk time), no
matter how many pages may or may not be faulted in later.  Rather
like RLIMIT_AS.  The proposed madvise operation won't affect it.

(But I take Ulrich's "must succeed" with one pinch of salt:
Out-Of-Memory killing remains a possibility, of course.)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread William Lee Irwin III
On Wed, Apr 04, 2007 at 06:09:18AM -0700, William Lee Irwin III wrote:
>   for (--i; i >= 0; --i) {
>   if (pthread_join(th[i], NULL)) {
>   perror("main: pthread_join failed");
>   ret = EXIT_FAILURE;
>   }
>   }

Obligatory brown paper bag patch:


--- ./jakub.c.orig  2007-04-04 05:57:23.409493248 -0700
+++ ./jakub.c   2007-04-04 06:35:34.296043432 -0700
@@ -232,10 +232,14 @@ int main(int argc, char *argv[])
}
}
for (--i; i >= 0; --i) {
-   if (pthread_join(th[i], NULL)) {
+   void *status;
+
+   if (pthread_join(th[i], )) {
perror("main: pthread_join failed");
ret = EXIT_FAILURE;
}
+   if (status != (void *)tr_success)
+   ret = EXIT_FAILURE;
}
free(th);
getrusage(RUSAGE_SELF, );


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread William Lee Irwin III
On Tue, Apr 03, 2007 at 04:29:37PM -0400, Jakub Jelinek wrote:
> void *
> tf (void *arg)
> {
>   (void) arg;
>   size_t ps = sysconf (_SC_PAGE_SIZE);
>   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>   if (p == MAP_FAILED)
> exit (1);
>   int i;

Oh dear.


-- wli
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

enum thread_return {
	tr_success	=  0,
	tr_mmap_init	= -1,
	tr_mmap_free	= -2,
	tr_mprotect	= -3,
	tr_madvise	= -4,
	tr_unknown	= -5,
	tr_munmap	= -6,
};

enum release_method {
	release_by_mmap		= 0,
	release_by_madvise	= 1,
	release_by_max		= 2,
};

struct thread_argument {
	size_t page_size;
	int iterations, pages_per_thread, nr_threads;
	enum release_method method;
};

static enum thread_return mmap_release(void *p, size_t n)
{
	void *q;

	q = mmap(p, n, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
	if (p != q) {
		perror("thread_function: mmap release failed");
		return tr_mmap_free;
	}
	if (mprotect(p, n, PROT_READ | PROT_WRITE)) {
		perror("thread_function: mprotect failed");
		return tr_mprotect;
	}
	return tr_success;
}

static enum thread_return madvise_release(void *p, size_t n)
{
	if (madvise(p, n, MADV_DONTNEED)) {
		perror("thread_function: madvise failed");
		return tr_madvise;
	}
	return tr_success;
}

static enum thread_return (*release_methods[])(void *, size_t) = {
	mmap_release,
	madvise_release,
};

static void *thread_function(void *__arg)
{
	char *p;
	int i;
	struct thread_argument *arg = __arg;
	size_t arena_size = arg->pages_per_thread * arg->page_size;

	p = (char *)mmap(NULL, arena_size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (p == MAP_FAILED) {
		perror("thread_function: arena allocation failed");
		return (void *)tr_mmap_init;
	}
	for (i = 0; i < arg->iterations; i++) {
		size_t s;
		char *q, *r;
		enum thread_return ret;

		/* Pretend to use the buffer.  */
		r = p + arena_size;
		for (q = p; q < r; q += arg->page_size)
			*q = 1;
		for (s = 0, q = p; q < r; q += arg->page_size)
			s += *q;
		if (arg->method >= release_by_max) {
			perror("thread_function: "
"unknown freeing method specified");
			return (void *)tr_unknown;
		}
		ret = (*release_methods[arg->method])(p, arena_size);
		if (ret != tr_success)
			return (void *)ret;
	}
	if (munmap(p, arena_size)) {
		perror("thread_function: munmap() failed");
		return (void *)tr_munmap;
	}
	return (void *)tr_success;
}

static int configure(struct thread_argument *arg, int argc, char *argv[])
{
	char optstring[] = "t:m:i:p:";
	int c, tmp, ret = 0;
	long n;

	n = sysconf(_SC_PAGE_SIZE);
	if (n < 0) {
		perror("configure: sysconf(_SC_PAGE_SIZE) failed");
		ret = -1;
	}
	arg->nr_threads = 32, 
	arg->page_size = (size_t)n;
	arg->method = release_by_mmap;
	arg->iterations = 10;
	arg->pages_per_thread = 128;

	while ((c = getopt(argc, argv, optstring)) != -1) {
		switch (c) {
			case 't':
if (sscanf(optarg, "%d", ) == 1)
	arg->nr_threads = tmp;
else {
	perror("configure: non-numeric thread count");
	ret = -1;
}
break;
			case 'm':
if (!strcmp(optarg, "mmap"))
	arg->method = release_by_mmap;
else if (!strcmp(optarg, "madvise"))
	arg->method = release_by_madvise;
else {
	perror("configure: unrecognised release method");
	ret = -1;
}
break;
			case 'i':
if (sscanf(optarg, "%d", ) == 1)
	arg->iterations = tmp;
else {
	perror("configure: non-numeric iteration count");
	ret = -1;
}
break;
			case 'p':
if (sscanf(optarg, "%d", ) == 1)
	arg->pages_per_thread = tmp;
else {
	perror("configure: non-numeric pages per thread count");
	ret = -1;
}
break;
			default:
perror("unrecognignized argument");
ret = -1;
		}
	}
	if (arg->nr_threads <= 0) {
		perror("configure: zero or negative thread count");
		ret = -1;
	}
	if (arg->iterations < 0) {
		perror("configure: negative iteration count");
		ret = -1;
	}
	if (arg->pages_per_thread <= 0) {
		perror("configure: zero or negative arena size");
		ret = -1;
	}
	if (SIZE_MAX/arg->page_size < (size_t)arg->pages_per_thread) {
		perror("configure: arena size overflow");
		ret = -1;
	}
	return ret;
}

static unsigned long long timeval_to_usec(struct timeval *tv)
{
	return 100*tv->tv_sec + tv->tv_usec;
}

static unsigned long long elapsed_usec(struct timeval *tv1, struct timeval *tv2)
{
	return timeval_to_usec(tv2) - timeval_to_usec(tv1);
}

#define user_usec(ru)	timeval_to_usec(&(ru)->ru_utime)
#define sys_usec(ru)	timeval_to_usec(&(ru)->ru_stime)
#define user_sec(ru)	((user_usec(ru) % 6000ULL)/100.0)
#define sys_sec(ru)	((sys_usec(ru) % 6000ULL)/100.0)
#define elapsed_sec(tv1, tv2)		\
		((elapsed_usec(tv1, tv2) % 6000ULL)/100.0)

#define user_min(ru)	((unsigned long)((user_usec(ru)/6000ULL) % 60))
#define sys_min(ru)	((unsigned long)((sys_usec(ru)/6000ULL) % 

Re: missing madvise functionality

2007-04-04 Thread Eric Dumazet
On Wed, 04 Apr 2007 20:05:54 +1000
Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > @@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u
> > unsigned long start;
> >  
> > addr &= PAGE_MASK;
> > -   vma = find_vma(mm,addr);
> > +   vma = find_vma(mm,addr,>vmacache);
> > if (!vma)
> > return NULL;
> > if (vma->vm_start <= addr)
> 
> So now you can have current calling find_extend_vma on someone else's mm
> but using their cache. So you're going to return current's vma, or current
> is going to get one of mm's vmas in its cache :P

This was not a working patch, just to throw the idea, since the answers I got 
showed I was not understood.

In this case, find_extend_vma() should of course have one struct vm_area_cache 
* argument, like find_vma()

One single cache on one mm is not scalable. oprofile badly hits it on a dual 
cpu config.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Eric Dumazet wrote:


Well, I believe this one is too expensive. I was thinking of a light one :


This one seems worse. Passing your vm_area_cache around everywhere, which
is just intrusive and dangerous because ot becomes decoupled from the mm
struct you are passing around. Watch this:



@@ -1638,7 +1652,7 @@ find_extend_vma(struct mm_struct * mm, u
unsigned long start;
 
 	addr &= PAGE_MASK;

-   vma = find_vma(mm,addr);
+   vma = find_vma(mm,addr,>vmacache);
if (!vma)
return NULL;
if (vma->vm_start <= addr)


So now you can have current calling find_extend_vma on someone else's mm
but using their cache. So you're going to return current's vma, or current
is going to get one of mm's vmas in its cache :P

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Eric Dumazet wrote:

On Wed, 04 Apr 2007 18:55:18 +1000
Nick Piggin <[EMAIL PROTECTED]> wrote:



Peter Zijlstra wrote:


On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote:



Eric Dumazet wrote:



I do think such workloads might benefit from a vma_cache not shared by 
all threads but private to each thread. A sequence could invalidate the 
cache(s).


ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
having a current->mmap_cache and current->mm_sequence


I have a patchset to do exactly this, btw.



/me too

However, I decided against pushing it because when it does happen that a
task is not involved with a vma lookup for longer than it takes the seq
count to wrap we have a stale pointer...

We could go and walk the tasks once in a while to reset the pointer, but
it all got a tad involved.


Well here is my core patch (against I think 2.6.16 + a set of vma cache
cleanups and abstractions). I didn't think the wrapping aspect was
terribly involved.



Well, I believe this one is too expensive. I was thinking of a light one :

I am not deleting mmap_sem, but adding a sequence number to mm_struct, that is 
incremented each time a vma is added/deleted, not each time mmap_sem is taken 
(read or write)


That's exactly what mine does (except IIRC it doesn't invalidate when
you add a vma).



Each thread has its own copy of the sequence, taken at the time find_vma() had 
to do a full lookup.

I believe some optimized paths could call check_vma_cache() without mmap_sem 
read lock taken, and if it fails, take the mmap_sem lock and do the slow path.


The mmap_sem for read does not only protect the mm_rb rbtree structure, but
the vmas themselves as well as their page tables, so you can't do that.

You could do it if you had a lock-per-vma to synchronise against write
operations, and rcu-freed vmas or some such... but I don't think we should
go down a road like that until we first remove mmap_sem from low hanging
things (like private futexes!) and then see who's complaining.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Eric Dumazet
On Wed, 04 Apr 2007 18:55:18 +1000
Nick Piggin <[EMAIL PROTECTED]> wrote:

> Peter Zijlstra wrote:
> > On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote:
> > 
> >>Eric Dumazet wrote:
> > 
> > 
> >>>I do think such workloads might benefit from a vma_cache not shared by 
> >>>all threads but private to each thread. A sequence could invalidate the 
> >>>cache(s).
> >>>
> >>>ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
> >>>having a current->mmap_cache and current->mm_sequence
> >>
> >>I have a patchset to do exactly this, btw.
> > 
> > 
> > /me too
> > 
> > However, I decided against pushing it because when it does happen that a
> > task is not involved with a vma lookup for longer than it takes the seq
> > count to wrap we have a stale pointer...
> > 
> > We could go and walk the tasks once in a while to reset the pointer, but
> > it all got a tad involved.
> 
> Well here is my core patch (against I think 2.6.16 + a set of vma cache
> cleanups and abstractions). I didn't think the wrapping aspect was
> terribly involved.

Well, I believe this one is too expensive. I was thinking of a light one :

I am not deleting mmap_sem, but adding a sequence number to mm_struct, that is 
incremented each time a vma is added/deleted, not each time mmap_sem is taken 
(read or write)

Each thread has its own copy of the sequence, taken at the time find_vma() had 
to do a full lookup.

I believe some optimized paths could call check_vma_cache() without mmap_sem 
read lock taken, and if it fails, take the mmap_sem lock and do the slow path.


--- linux-2.6.21-rc5/include/linux/sched.h
+++ linux-2.6.21-rc5-ed/include/linux/sched.h
@@ -319,10 +319,14 @@ typedef unsigned long mm_counter_t;
(mm)->hiwater_vm = (mm)->total_vm;  \
 } while (0)
 
+struct vm_area_cache {
+   struct vm_area_struct * mmap_cache; /* last find_vma result */
+   unsigned int sequence;
+   };
+
 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
struct rb_root mm_rb;
-   struct vm_area_struct * mmap_cache; /* last find_vma result */
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
@@ -336,6 +340,7 @@ struct mm_struct {
atomic_t mm_count;  /* How many references to 
"struct mm_struct" (users count as 1) */
int map_count;  /* number of VMAs */
struct rw_semaphore mmap_sem;
+   unsigned int mm_sequence;
spinlock_t page_table_lock; /* Protects page tables and 
some counters */
 
struct list_head mmlist;/* List of maybe swapped mm's.  
These are globally strung
@@ -875,7 +880,7 @@ struct task_struct {
struct list_head tasks;
 
struct mm_struct *mm, *active_mm;
-
+   struct vm_area_cache vmacache;
 /* task state */
struct linux_binfmt *binfmt;
int exit_state;
--- linux-2.6.21-rc5/include/linux/mm.h
+++ linux-2.6.21-rc5-ed/include/linux/mm.h
@@ -1176,15 +1176,18 @@ extern int expand_upwards(struct vm_area
 #endif
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long 
addr);
+extern struct vm_area_struct * find_vma(struct mm_struct * mm,
+   unsigned long addr,
+   struct vm_area_cache *cache);
 extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned 
long addr,
 struct vm_area_struct **pprev);
 
 /* Look up the first VMA which intersects the interval start_addr..end_addr-1,
NULL if none.  Assume start_addr < end_addr. */
-static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * 
mm, unsigned long start_addr, unsigned long end_addr)
+static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * 
mm, 
+   unsigned long start_addr, unsigned long end_addr, struct vm_area_cache 
*cache)
 {
-   struct vm_area_struct * vma = find_vma(mm,start_addr);
+   struct vm_area_struct * vma = find_vma(mm,start_addr,cache);
 
if (vma && end_addr <= vma->vm_start)
vma = NULL;
--- linux-2.6.21-rc5/mm/mmap.c
+++ linux-2.6.21-rc5-ed/mm/mmap.c
@@ -267,7 +267,7 @@ asmlinkage unsigned long sys_brk(unsigne
}
 
/* Check against existing mmap mappings. */
-   if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))
+   if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE, 
>vmacache))
goto out;
 
/* Ok, looks good - let it rip. */
@@ -447,6 +447,7 @@ static void vma_link(struct mm_struct *m
spin_unlock(>i_mmap_lock);
 
mm->map_count++;
+   mm->mm_sequence++;
validate_mm(mm);
 }

Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

William Lee Irwin III wrote:

On Wed, Apr 04, 2007 at 06:55:18PM +1000, Nick Piggin wrote:


+   rcu_read_lock();
+   do {
+   t->vma_cache_sequence = -1;
+   t = next_thread(t);
+   } while (t != curr);
+   rcu_read_unlock();



LD_ASSUME_KERNEL=2.4.18 anyone?


Meaning?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Hugh Dickins
On Tue, 3 Apr 2007, Andrew Morton wrote:
> 
> All of which indicates that if we can remove the down_write(mmap_sem) from
> this glibc operation, things should get a lot better - there will be no
> additional context switches at all.
> 
> And we can surely do that if all we're doing is looking up pageframes,
> putting pages into fake-swapcache and moving them around on the page LRUs.
> 
> Hugh?  Sanity check?

Setting aside the fake-swapcache part, yes, Rik should be able to do what
Ulrich wants (operating on ptes and pages) without down_write(mmap_sem):
just needing down_read(mmap_sem) to keep the whole vma/pagetable structure
stable, and page table lock (literal or per-page-table) for each contents.

(I didn't understand how Rik would achieve his point 5, _no_ lock
contention while repeatedly re-marking these pages, but never mind.)

(Some mails in this thread overlook that we also use down_write(mmap_sem)
to guard simple things like vma->vm_flags: of course that in itself could
be manipulated with atomics, or spinlock; but like many of the vma fields,
changing it goes hand in hand with the chance that we have to split vma,
which does require the heavy-handed down_write(mmap_sem).  I expect that
splitting those uses apart would be harder than first appears, and better
to go for a more radical redesign - I don't know what.)

But you lose me with the fake-swapcache part of it: that came, I think,
from your initial idea that it would be okay to refault on these ptes.
Don't we all agree now that we'd prefer not to refault on those ptes,
unless some memory pressure has actually decided to pull them out?
(Hmm, yet more list balancing...)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread William Lee Irwin III
On Wed, Apr 04, 2007 at 06:55:18PM +1000, Nick Piggin wrote:
> + rcu_read_lock();
> + do {
> + t->vma_cache_sequence = -1;
> + t = next_thread(t);
> + } while (t != curr);
> + rcu_read_unlock();

LD_ASSUME_KERNEL=2.4.18 anyone?


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Peter Zijlstra wrote:

On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote:


Eric Dumazet wrote:



I do think such workloads might benefit from a vma_cache not shared by 
all threads but private to each thread. A sequence could invalidate the 
cache(s).


ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
having a current->mmap_cache and current->mm_sequence


I have a patchset to do exactly this, btw.



/me too

However, I decided against pushing it because when it does happen that a
task is not involved with a vma lookup for longer than it takes the seq
count to wrap we have a stale pointer...

We could go and walk the tasks once in a while to reset the pointer, but
it all got a tad involved.


Well here is my core patch (against I think 2.6.16 + a set of vma cache
cleanups and abstractions). I didn't think the wrapping aspect was
terribly involved.

--
SUSE Labs, Novell Inc.
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -296,6 +296,8 @@ struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
struct vm_area_struct *vma_cache;   /* find_vma cache */
+   unsigned long vma_sequence;
+
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
@@ -693,6 +695,8 @@ enum sleep_type {
SLEEP_INTERRUPTED,
 };
 
+#define VMA_CACHE_SIZE 4
+
 struct task_struct {
volatile long state;/* -1 unrunnable, 0 runnable, >0 stopped */
struct thread_info *thread_info;
@@ -734,6 +738,8 @@ struct task_struct {
struct list_head ptrace_list;
 
struct mm_struct *mm, *active_mm;
+   struct vm_area_struct *vma_cache[VMA_CACHE_SIZE];
+   unsigned long vma_cache_sequence;
 
 /* task state */
struct linux_binfmt *binfmt;
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -32,6 +32,40 @@
 
 static void vma_cache_touch(struct mm_struct *mm, struct vm_area_struct *vma)
 {
+   struct task_struct *curr = current;
+   if (mm == curr->mm) {
+   int i;
+   if (curr->vma_cache_sequence != mm->vma_sequence) {
+   curr->vma_cache_sequence = mm->vma_sequence;
+   curr->vma_cache[0] = vma;
+   for (i = 1; i < VMA_CACHE_SIZE; i++)
+   curr->vma_cache[i] = NULL;
+   } else {
+   int update_mm;
+
+   if (curr->vma_cache[0] == vma)
+   return;
+
+   for (i = 1; i < VMA_CACHE_SIZE; i++) {
+   if (curr->vma_cache[i] == vma)
+   break;
+   }
+   update_mm = 0;
+   if (i == VMA_CACHE_SIZE) {
+   update_mm = 1;
+   i = VMA_CACHE_SIZE-1;
+   }
+   while (i != 0) {
+   curr->vma_cache[i] = curr->vma_cache[i-1];
+   i--;
+   }
+   curr->vma_cache[0] = vma;
+
+   if (!update_mm)
+   return;
+   }
+   }
+
if (mm->vma_cache != vma) /* prevent cacheline bouncing */
mm->vma_cache = vma;
 }
@@ -39,27 +73,56 @@ static void vma_cache_touch(struct mm_st
 static void vma_cache_replace(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *repl)
 {
+   mm->vma_sequence++;
+   if (unlikely(mm->vma_sequence == 0)) {
+   struct task_struct *curr = current, *t;
+   t = curr;
+   rcu_read_lock();
+   do {
+   t->vma_cache_sequence = -1;
+   t = next_thread(t);
+   } while (t != curr);
+   rcu_read_unlock();
+   }
+
if (mm->vma_cache == vma)
mm->vma_cache = repl;
 }
 
 static void vma_cache_invalidate(struct mm_struct *mm, struct vm_area_struct 
*vma)
 {
-   if (mm->vma_cache == vma)
-   mm->vma_cache = NULL;
+   vma_cache_replace(mm, vma, NULL);
 }
 
 static struct vm_area_struct *vma_cache_find(struct mm_struct *mm,
unsigned long addr)
 {
-   struct vm_area_struct *vma = mm->vma_cache;
+   struct task_struct *curr;
+   struct vm_area_struct *vma;
 
preempt_disable();
__inc_page_state(vma_cache_query);
-   if (vma 

Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Jakub Jelinek wrote:

On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote:


Does mmap(PROT_NONE) actually free the memory?



Yes.
/* Clear old maps */
error = -ENOMEM;
munmap_back:
vma = find_vma_prepare(mm, addr, , _link, _parent);
if (vma && vma->vm_start < addr + len) {
if (do_munmap(mm, addr, len))
return -ENOMEM;
goto munmap_back;
}


Thanks, I overlooked the mmap vs mprotect detail. So how are the subsequent
access faults avoided?



In the case of pages being unused then almost immediately reused, why is
it a bad solution to avoid freeing? Is it that you want to avoid
heuristics because in some cases they could fail and end up using memory?



free(3) doesn't know if the memory will be reused soon, late or never.
So avoiding trimming could substantially increase memory consumption with
certain malloc/free patterns, especially in threaded programs that use
multiple arenas.  Implementing some sort of deferred memory trimming
in malloc is "solving" the problem in a wrong place, each app really has no
idea (and should not have) what the current system memory pressure is.


Thanks for the clarification.



Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
than a syscall? (including the cost of the TLB fill for the memory access
after the syscall, of course).



That's page fault per page rather than a syscall for the whole chunk,
furthermore zeroing is expensive.


Ah, for big allocations. OK, we could make a MADV_POPULATE to prefault
pages (like mmap's MAP_POPULATE, but without the down_write(mmap_sem)).

If you're just about to use the pages anyway, how much of a win would
it be to avoid zeroing? We allocate cache hot pages for these guys...

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Peter Zijlstra
On Wed, 2007-04-04 at 12:22 +1000, Nick Piggin wrote:
> Eric Dumazet wrote:

> > I do think such workloads might benefit from a vma_cache not shared by 
> > all threads but private to each thread. A sequence could invalidate the 
> > cache(s).
> > 
> > ie instead of a mm->mmap_cache, having a mm->sequence, and each thread 
> > having a current->mmap_cache and current->mm_sequence
> 
> I have a patchset to do exactly this, btw.

/me too

However, I decided against pushing it because when it does happen that a
task is not involved with a vma lookup for longer than it takes the seq
count to wrap we have a stale pointer...

We could go and walk the tasks once in a while to reset the pointer, but
it all got a tad involved.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Jakub Jelinek
On Wed, Apr 04, 2007 at 05:46:12PM +1000, Nick Piggin wrote:
> Does mmap(PROT_NONE) actually free the memory?

Yes.
/* Clear old maps */
error = -ENOMEM;
munmap_back:
vma = find_vma_prepare(mm, addr, , _link, _parent);
if (vma && vma->vm_start < addr + len) {
if (do_munmap(mm, addr, len))
return -ENOMEM;
goto munmap_back;
}

> In the case of pages being unused then almost immediately reused, why is
> it a bad solution to avoid freeing? Is it that you want to avoid
> heuristics because in some cases they could fail and end up using memory?

free(3) doesn't know if the memory will be reused soon, late or never.
So avoiding trimming could substantially increase memory consumption with
certain malloc/free patterns, especially in threaded programs that use
multiple arenas.  Implementing some sort of deferred memory trimming
in malloc is "solving" the problem in a wrong place, each app really has no
idea (and should not have) what the current system memory pressure is.

> Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
> than a syscall? (including the cost of the TLB fill for the memory access
> after the syscall, of course).

That's page fault per page rather than a syscall for the whole chunk,
furthermore zeroing is expensive.

We really want something like FreeBSD MADV_FREE in Linux, see e.g.
http://mail.nl.linux.org/linux-mm/2000-03/msg00059.html
for some details.  Apparently FreeBSD malloc is using MADV_FREE for years
(according to their CVS for 10 years already).

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Nick Piggin wrote:

Ulrich Drepper wrote:


People might remember the thread about mysql not scaling and pointing
the finger quite happily at glibc.  Well, the situation is not like that.

The problem is glibc has to work around kernel limitations.  If the
malloc implementation detects that a large chunk of previously allocated
memory is now free and unused it wants to return the memory to the
system.  What we currently have to do is this:

  to free:  mmap(PROT_NONE) over the area
  to reuse: mprotect(PROT_READ|PROT_WRITE)

Yep, that's expensive, both operations need to get locks preventing
other threads from doing the same.

Some people were quick to suggest that we simply avoid the freeing in
many situations (that's what the patch submitted by Yanmin Zhang
basically does).  That's no solution.  One of the very good properties
of the current allocator is that it does not use much memory.



Does mmap(PROT_NONE) actually free the memory?



A solution for this problem is a madvise() operation with the following
property:

  - the content of the address range can be discarded

  - if an access to a page in the range happens in the future it must
succeed.  The old page content can be provided or a new, empty page
can be provided

That's it.  The current MADV_DONTNEED doesn't cut it because it zaps the
pages, causing *all* future reuses to create page faults.  This is what
I guess happens in the mysql test case where the pages where unused and
freed but then almost immediately reused.  The page faults erased all
the benefits of using one mprotect() call vs a pair of mmap()/mprotect()
calls.



Two questions.

In the case of pages being unused then almost immediately reused, why is
it a bad solution to avoid freeing? Is it that you want to avoid
heuristics because in some cases they could fail and end up using memory?

Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
than a syscall? (including the cost of the TLB fill for the memory access
after the syscall, of course).

zapping the pages puts them on a nice LIFO cache hot list of pages that
can be quickly used when the next fault comes in, or used for any other
allocation in the kernel. Putting them on some sort of reclaim list seems
a bit pointless.

Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).


BTW. and this way it becomes much more attractive than using mmap/mprotect
can ever be, because they must take mmap_sem for writing always.

You don't actually need to protect the ranges unless running with use after
free debugging turned on, do you?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Ulrich Drepper wrote:

People might remember the thread about mysql not scaling and pointing
the finger quite happily at glibc.  Well, the situation is not like that.

The problem is glibc has to work around kernel limitations.  If the
malloc implementation detects that a large chunk of previously allocated
memory is now free and unused it wants to return the memory to the
system.  What we currently have to do is this:

  to free:  mmap(PROT_NONE) over the area
  to reuse: mprotect(PROT_READ|PROT_WRITE)

Yep, that's expensive, both operations need to get locks preventing
other threads from doing the same.

Some people were quick to suggest that we simply avoid the freeing in
many situations (that's what the patch submitted by Yanmin Zhang
basically does).  That's no solution.  One of the very good properties
of the current allocator is that it does not use much memory.


Does mmap(PROT_NONE) actually free the memory?



A solution for this problem is a madvise() operation with the following
property:

  - the content of the address range can be discarded

  - if an access to a page in the range happens in the future it must
succeed.  The old page content can be provided or a new, empty page
can be provided

That's it.  The current MADV_DONTNEED doesn't cut it because it zaps the
pages, causing *all* future reuses to create page faults.  This is what
I guess happens in the mysql test case where the pages where unused and
freed but then almost immediately reused.  The page faults erased all
the benefits of using one mprotect() call vs a pair of mmap()/mprotect()
calls.


Two questions.

In the case of pages being unused then almost immediately reused, why is
it a bad solution to avoid freeing? Is it that you want to avoid
heuristics because in some cases they could fail and end up using memory?

Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
than a syscall? (including the cost of the TLB fill for the memory access
after the syscall, of course).

zapping the pages puts them on a nice LIFO cache hot list of pages that
can be quickly used when the next fault comes in, or used for any other
allocation in the kernel. Putting them on some sort of reclaim list seems
a bit pointless.

Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/madvise.c
===
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -12,6 +12,25 @@
 #include 
 
 /*
+ * Any behaviour which results in changes to the vma->vm_flags needs to
+ * take mmap_sem for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
+static int madvise_need_mmap_write(int behavior)
+{
+   switch (behavior) {
+   case MADV_DOFORK:
+   case MADV_DONTFORK:
+   case MADV_NORMAL:
+   case MADV_SEQUENTIAL:
+   case MADV_RANDOM:
+   return 1;
+   default:
+   return 0;
+   }
+}
+
+/*
  * We can potentially split a vm area into separate
  * areas, each area with its own behavior.
  */
@@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon
int error = -EINVAL;
size_t len;
 
-   down_write(>mm->mmap_sem);
+   if (madvise_need_mmap_write(behavior))
+   down_write(>mm->mmap_sem);
+   else
+   down_read(>mm->mmap_sem);
 
if (start & ~PAGE_MASK)
goto out;
@@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon
vma = prev->vm_next;
}
 out:
-   up_write(>mm->mmap_sem);
+   if (madvise_need_mmap_write(behavior))
+   up_write(>mm->mmap_sem);
+   else
+   up_read(>mm->mmap_sem);
+
return error;
 }


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Eric Dumazet
On Tue, 03 Apr 2007 23:54:42 -0700
Ulrich Drepper <[EMAIL PROTECTED]> wrote:

> Eric Dumazet wrote:
> > You were CC on this one, you can find an archive here :
> 
> You cc:ed my gmail account.  I don't pick out mails sent to me there.
> If you want me to look at something you have to send it to my
> @redhat.com address.

What I meant is : You got the mails and even replied to one of them :)

http://lkml.org/lkml/2007/3/15/303

I will try to remember your email address, thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Ulrich Drepper
Eric Dumazet wrote:
> You were CC on this one, you can find an archive here :

You cc:ed my gmail account.  I don't pick out mails sent to me there.
If you want me to look at something you have to send it to my
@redhat.com address.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Eric Dumazet

Ulrich Drepper a écrit :

Nick Piggin wrote:

Sad. Although Ulrich did seem interested at one point I think? Ulrich,
do you agree at least with the interface that Eric is proposing?


I have no idea what you're talking about.



You were CC on this one, you can find an archive here :

http://lkml.org/lkml/2007/3/15/230

This avoids mmap_sem for private futexes (PTHREAD_PROCESS_PRIVATE  semantic)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Nick Piggin

Ulrich Drepper wrote:

Nick Piggin wrote:


Sad. Although Ulrich did seem interested at one point I think? Ulrich,
do you agree at least with the interface that Eric is proposing?



I have no idea what you're talking about.



Private futexes.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Ulrich Drepper
Nick Piggin wrote:
> Sad. Although Ulrich did seem interested at one point I think? Ulrich,
> do you agree at least with the interface that Eric is proposing?

I have no idea what you're talking about.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Nick Piggin

(sorry to change the subjet, I was initially going to send the
threaded vma cache patches on list, but then decided they didn't
have enough changelog!)

Andrew Morton wrote:

On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



Andrew, do you have any objections to putting Eric's fairly
important patch at least into -mm?



you know what to do ;)



Well I did review them when he last posted, but simply didn't have
much to say (that happened in a much older discussion about the
private futex problem, and I ended up agreeing with this approach).
Anyway I'll have another look when they get posted again.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Andrew Morton
On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

> Andrew, do you have any objections to putting Eric's fairly
> important patch at least into -mm?

you know what to do ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Nick Piggin

Eric Dumazet wrote:

Nick Piggin a écrit :


Eric Dumazet wrote:



I do think such workloads might benefit from a vma_cache not shared 
by all threads but private to each thread. A sequence could 
invalidate the cache(s).


ie instead of a mm->mmap_cache, having a mm->sequence, and each 
thread having a current->mmap_cache and current->mm_sequence



I have a patchset to do exactly this, btw.



Could you repost it please ?


Sure. I'll send you them privately because they're against an older
kernel.


Anyway what is the status of the private futex work. I don't think that
is very intrusive or complicated, so it should get merged ASAP (so then
at least we have the interface there).



It seems nobody but you and me cared.


Sad. Although Ulrich did seem interested at one point I think? Ulrich,
do you agree at least with the interface that Eric is proposing? If
yes, then Andrew, do you have any objections to putting Eric's fairly
important patch at least into -mm?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Nick Piggin

Eric Dumazet wrote:

Nick Piggin a écrit :


Eric Dumazet wrote:



I do think such workloads might benefit from a vma_cache not shared 
by all threads but private to each thread. A sequence could 
invalidate the cache(s).


ie instead of a mm-mmap_cache, having a mm-sequence, and each 
thread having a current-mmap_cache and current-mm_sequence



I have a patchset to do exactly this, btw.



Could you repost it please ?


Sure. I'll send you them privately because they're against an older
kernel.


Anyway what is the status of the private futex work. I don't think that
is very intrusive or complicated, so it should get merged ASAP (so then
at least we have the interface there).



It seems nobody but you and me cared.


Sad. Although Ulrich did seem interested at one point I think? Ulrich,
do you agree at least with the interface that Eric is proposing? If
yes, then Andrew, do you have any objections to putting Eric's fairly
important patch at least into -mm?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Andrew Morton
On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

 Andrew, do you have any objections to putting Eric's fairly
 important patch at least into -mm?

you know what to do ;)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Nick Piggin

(sorry to change the subjet, I was initially going to send the
threaded vma cache patches on list, but then decided they didn't
have enough changelog!)

Andrew Morton wrote:

On Wed, 04 Apr 2007 16:09:40 +1000 Nick Piggin [EMAIL PROTECTED] wrote:



Andrew, do you have any objections to putting Eric's fairly
important patch at least into -mm?



you know what to do ;)



Well I did review them when he last posted, but simply didn't have
much to say (that happened in a much older discussion about the
private futex problem, and I ended up agreeing with this approach).
Anyway I'll have another look when they get posted again.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Ulrich Drepper
Nick Piggin wrote:
 Sad. Although Ulrich did seem interested at one point I think? Ulrich,
 do you agree at least with the interface that Eric is proposing?

I have no idea what you're talking about.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Nick Piggin

Ulrich Drepper wrote:

Nick Piggin wrote:


Sad. Although Ulrich did seem interested at one point I think? Ulrich,
do you agree at least with the interface that Eric is proposing?



I have no idea what you're talking about.



Private futexes.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Eric Dumazet

Ulrich Drepper a écrit :

Nick Piggin wrote:

Sad. Although Ulrich did seem interested at one point I think? Ulrich,
do you agree at least with the interface that Eric is proposing?


I have no idea what you're talking about.



You were CC on this one, you can find an archive here :

http://lkml.org/lkml/2007/3/15/230

This avoids mmap_sem for private futexes (PTHREAD_PROCESS_PRIVATE  semantic)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Ulrich Drepper
Eric Dumazet wrote:
 You were CC on this one, you can find an archive here :

You cc:ed my gmail account.  I don't pick out mails sent to me there.
If you want me to look at something you have to send it to my
@redhat.com address.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [patches] threaded vma patches (was Re: missing madvise functionality)

2007-04-04 Thread Eric Dumazet
On Tue, 03 Apr 2007 23:54:42 -0700
Ulrich Drepper [EMAIL PROTECTED] wrote:

 Eric Dumazet wrote:
  You were CC on this one, you can find an archive here :
 
 You cc:ed my gmail account.  I don't pick out mails sent to me there.
 If you want me to look at something you have to send it to my
 @redhat.com address.

What I meant is : You got the mails and even replied to one of them :)

http://lkml.org/lkml/2007/3/15/303

I will try to remember your email address, thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: missing madvise functionality

2007-04-04 Thread Nick Piggin

Ulrich Drepper wrote:

People might remember the thread about mysql not scaling and pointing
the finger quite happily at glibc.  Well, the situation is not like that.

The problem is glibc has to work around kernel limitations.  If the
malloc implementation detects that a large chunk of previously allocated
memory is now free and unused it wants to return the memory to the
system.  What we currently have to do is this:

  to free:  mmap(PROT_NONE) over the area
  to reuse: mprotect(PROT_READ|PROT_WRITE)

Yep, that's expensive, both operations need to get locks preventing
other threads from doing the same.

Some people were quick to suggest that we simply avoid the freeing in
many situations (that's what the patch submitted by Yanmin Zhang
basically does).  That's no solution.  One of the very good properties
of the current allocator is that it does not use much memory.


Does mmap(PROT_NONE) actually free the memory?



A solution for this problem is a madvise() operation with the following
property:

  - the content of the address range can be discarded

  - if an access to a page in the range happens in the future it must
succeed.  The old page content can be provided or a new, empty page
can be provided

That's it.  The current MADV_DONTNEED doesn't cut it because it zaps the
pages, causing *all* future reuses to create page faults.  This is what
I guess happens in the mysql test case where the pages where unused and
freed but then almost immediately reused.  The page faults erased all
the benefits of using one mprotect() call vs a pair of mmap()/mprotect()
calls.


Two questions.

In the case of pages being unused then almost immediately reused, why is
it a bad solution to avoid freeing? Is it that you want to avoid
heuristics because in some cases they could fail and end up using memory?

Secondly, why is MADV_DONTNEED bad? How much more expensive is a pagefault
than a syscall? (including the cost of the TLB fill for the memory access
after the syscall, of course).

zapping the pages puts them on a nice LIFO cache hot list of pages that
can be quickly used when the next fault comes in, or used for any other
allocation in the kernel. Putting them on some sort of reclaim list seems
a bit pointless.

Oh, also: something like this patch would help out MADV_DONTNEED, as it
means it can run concurrently with page faults. I think the locking will
work (but needs forward porting).

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/madvise.c
===
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -12,6 +12,25 @@
 #include linux/hugetlb.h
 
 /*
+ * Any behaviour which results in changes to the vma-vm_flags needs to
+ * take mmap_sem for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
+static int madvise_need_mmap_write(int behavior)
+{
+   switch (behavior) {
+   case MADV_DOFORK:
+   case MADV_DONTFORK:
+   case MADV_NORMAL:
+   case MADV_SEQUENTIAL:
+   case MADV_RANDOM:
+   return 1;
+   default:
+   return 0;
+   }
+}
+
+/*
  * We can potentially split a vm area into separate
  * areas, each area with its own behavior.
  */
@@ -264,7 +283,10 @@ asmlinkage long sys_madvise(unsigned lon
int error = -EINVAL;
size_t len;
 
-   down_write(current-mm-mmap_sem);
+   if (madvise_need_mmap_write(behavior))
+   down_write(current-mm-mmap_sem);
+   else
+   down_read(current-mm-mmap_sem);
 
if (start  ~PAGE_MASK)
goto out;
@@ -323,6 +345,10 @@ asmlinkage long sys_madvise(unsigned lon
vma = prev-vm_next;
}
 out:
-   up_write(current-mm-mmap_sem);
+   if (madvise_need_mmap_write(behavior))
+   up_write(current-mm-mmap_sem);
+   else
+   up_read(current-mm-mmap_sem);
+
return error;
 }


  1   2   >