date:20071003

[PATCH] JBD2/Ext4: Convert kmalloc to kzalloc in jbd2/ext4

2007-10-03 Thread Theodore Ts'o

From: Mingming Cao <[EMAIL PROTECTED]>

Convert kmalloc to kzalloc() and get rid of the memset().

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
---
 fs/ext4/xattr.c   |3 +--
 fs/jbd2/journal.c |3 +--
 fs/jbd2/transaction.c |3 +--
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index b10d68f..12c7d65 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -750,12 +750,11 @@ ext4_xattr_block_set(handle_t *handle, struct inode 
*inode,
}
} else {
/* Allocate a buffer where we construct the new block. */
-   s->base = kmalloc(sb->s_blocksize, GFP_KERNEL);
+   s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
/* assert(header == s->base) */
error = -ENOMEM;
if (s->base == NULL)
goto cleanup;
-   memset(s->base, 0, sb->s_blocksize);
header(s->base)->h_magic = cpu_to_le32(EXT4_XATTR_MAGIC);
header(s->base)->h_blocks = cpu_to_le32(1);
header(s->base)->h_refcount = cpu_to_le32(1);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 0e329a3..f12c65b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -654,10 +654,9 @@ static journal_t * journal_init_common (void)
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+   journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
-   memset(journal, 0, sizeof(*journal));
 
init_waitqueue_head(>j_wait_transaction_locked);
init_waitqueue_head(>j_wait_logspace);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a5fb70f..b1fcf2b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -96,13 +96,12 @@ static int start_this_handle(journal_t *journal, handle_t 
*handle)
 
 alloc_transaction:
if (!journal->j_running_transaction) {
-   new_transaction = kmalloc(sizeof(*new_transaction),
+   new_transaction = kzalloc(sizeof(*new_transaction),
GFP_NOFS|__GFP_NOFAIL);
if (!new_transaction) {
ret = -ENOMEM;
goto out;
}
-   memset(new_transaction, 0, sizeof(*new_transaction));
}
 
jbd_debug(3, "New handle %p going live.\n", handle);
-- 
1.5.3.2.81.g17ed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] jbd/jbd2: Journal initialization doesn't need __GFP_NOFAIL

2007-10-03 Thread Theodore Ts'o

From: Aneesh Kumar K.V <[EMAIL PROTECTED]>

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
Signed-off-by: "Theodore Ts'o" <[EMAIL PROTECTED]>
---
 fs/jbd/journal.c  |2 +-
 fs/jbd2/journal.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index ae2c25d..8d6d475 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (void)
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 4281244..0e329a3 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (void)
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
-- 
1.5.3.2.81.g17ed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Casey Dahlin


Ingo Molnar wrote:

* Jarek Poplawski <[EMAIL PROTECTED]> wrote:
  
[...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems 
to return... time.)



wrong again. That is a function, not a variable to be cleared.


It still gives us a target time, so could we not simply have sched_yield 
put the thread completely to sleep for the given amount of time? It 
wholly redefines the operation, and its far more expensive (now there's 
a whole new timer involved) but it might emulate the expected behavior. 
Its hideous, but so is sched_yield in the first place, so why not?


--CJD
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-03 Thread Torsten Kaiser

On 10/3/07, Matt Mackall <[EMAIL PROTECTED]> wrote:
> Well I can see no reason why the vma we just got to by the mm->mmap
> would have a vm_mm != mm, but I've certainly been wrong before.
>
> Try changing it to:
>
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (!is_vm_hugetlb_page(vma)) {
> if (vma->vm_mm != mm)
> printk("WTF: vma->vm_mm %p mm %p\n",
> vma->vm_mm, mm);
> walk_page_range(vma->vm_mm, vma->vm_start, 
> vma->vm_end,
> _refs_walk, vma);
> }

You were right.
I was able to trigger the error with above printk added, but nothing
was written to the syslog.

So now I'm rather out of ideas what to test... :(

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[rfc][patch 2/3] x86: fix IO write barriers

2007-10-03 Thread Nick Piggin


wmb() on x86 must always include a barrier, because stores can go out of
order in many cases when dealing with devices (eg. WC memory).

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/include/asm-i386/system.h
===
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -216,6 +216,7 @@ static inline unsigned long get_limit(un
 
 #define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
 #define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
+#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 
 /**
  * read_barrier_depends - Flush all pending reads that subsequents reads
@@ -271,18 +272,14 @@ static inline unsigned long get_limit(un
 
 #define read_barrier_depends() do { } while(0)
 
-#ifdef CONFIG_X86_OOSTORE
-/* Actually there are no OOO store capable CPUs for now that do SSE, 
-   but make it already an possibility. */
-#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
-#else
-#define wmb()  __asm__ __volatile__ ("": : :"memory")
-#endif
-
 #ifdef CONFIG_SMP
 #define smp_mb()   mb()
 #define smp_rmb()  rmb()
-#define smp_wmb()  wmb()
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb() wmb()
+#else
+# define smp_wmb() barrier()
+#endif
 #define smp_read_barrier_depends() read_barrier_depends()
 #define set_mb(var, value) do { (void) xchg(, value); } while (0)
 #else
Index: linux-2.6/include/asm-x86_64/system.h
===
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -159,12 +159,8 @@ static inline void write_cr8(unsigned lo
  */
 #define mb()   asm volatile("mfence":::"memory")
 #define rmb()  asm volatile("lfence":::"memory")
-
-#ifdef CONFIG_UNORDERED_IO
 #define wmb()  asm volatile("sfence" ::: "memory")
-#else
-#define wmb()  asm volatile("" ::: "memory")
-#endif
+
 #define read_barrier_depends() do {} while(0)
 #define set_mb(var, value) do { (void) xchg(, value); } while (0)
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[rfc][patch 3/3] x86: optimise barriers

2007-10-03 Thread Nick Piggin


According to latest memory ordering specification documents from Intel and
AMD, both manufacturers are committed to in-order loads from cacheable memory
for the x86 architecture. Hence, smp_rmb() may be a simple barrier.

Also according to those documents, and according to existing practice in Linux
(eg. spin_unlock doesn't enforce ordering), stores to cacheable memory are
visible in program order too. Special string stores are safe -- their
constituent stores may be out of order, but they must complete in order WRT
surrounding stores. Nontemporal stores to WB memory can go out of order, and so
they should be fenced explicitly to make them appear in-order WRT other stores.
Hence, smp_wmb() may be a simple barrier.

http://developer.intel.com/products/processor/manuals/318147.pdf
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

In userspace microbenchmarks on a core2 system, fence instructions range
anywhere from around 15 cycles to 50, which may not be totally insignificant
in performance critical paths (code size will go down too).

However the primary motivation for this is to have the canonical barrier
implementation for x86 architecture.

smp_rmb on buggy pentium pros remains a locked op, which is apparently
required.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
Index: linux-2.6/include/asm-i386/system.h
===
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -274,7 +274,11 @@ static inline unsigned long get_limit(un
 
 #ifdef CONFIG_SMP
 #define smp_mb()   mb()
-#define smp_rmb()  rmb()
+#ifdef CONFIG_X86_PPRO_FENCE
+# define smp_rmb() rmb()
+#else
+# define smp_rmb() barrier()
+#endif
 #ifdef CONFIG_X86_OOSTORE
 # define smp_wmb() wmb()
 #else
Index: linux-2.6/include/asm-x86_64/system.h
===
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -141,8 +141,8 @@ static inline void write_cr8(unsigned lo
 
 #ifdef CONFIG_SMP
 #define smp_mb()   mb()
-#define smp_rmb()  rmb()
-#define smp_wmb()  wmb()
+#define smp_rmb()  barrier()
+#define smp_wmb()  barrier()
 #define smp_read_barrier_depends() do {} while(0)
 #else
 #define smp_mb()   barrier()
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[rfc][patch 1/3] x86_64: fence nontemproal stores

2007-10-03 Thread Nick Piggin

Hi,

Here's a couple of patches to improve the memory barrier situation on x86.
They probably aren't going upstream until after the x86 merge, however I'm
posting them here for RFC, and in case anybody wants to backport into stable
trees.

---
movnt* instructions are not strongly ordered with respect to other stores,
so if we are to assume stores are strongly ordered in the rest of the x86_64
kernel, we must fence these off (see similar examples in i386 kernel).

[ The AMD memory ordering document seems to say that nontemporal stores can
  also pass earlier regular stores, so maybe we need sfences _before_ movnt*
  everywhere too? ]

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/arch/x86_64/lib/copy_user_nocache.S
===
--- linux-2.6.orig/arch/x86_64/lib/copy_user_nocache.S
+++ linux-2.6/arch/x86_64/lib/copy_user_nocache.S
@@ -117,6 +117,7 @@ ENTRY(__copy_user_nocache)
popq %rbx
CFI_ADJUST_CFA_OFFSET -8
CFI_RESTORE rbx
+   sfence
ret
CFI_RESTORE_STATE
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + add-documentation-w1w1-masters-00-index.patch added to -mm tree

2007-10-03 Thread Randy Dunlap


Rob Landley wrote:

+   - The Maixm/Dallas Semiconductor DS2490 builds USB <-> W1 bridges.

  Maxim (2 times)


That typo was cut and paste from the the "Description" section of both files.  
(Lines 18 and 13, respectively.)  :(


Attached is an updated version that spells it "maxim" and also fixes the typos 
in the source files, if that helps...



Was this patch posted to a mailing list?  if so, which one?
I didn't see it.


LKML on saturday.
http://lkml.org/lkml/2007/9/29/168

My pending patches are all at http://landley.net/kdocs/make/patches although 
I'm waiting for the current batch to work through before posting more.



Thanks.  Looks good.


From: Rob Landley <[EMAIL PROTECTED]>

Two 00-INDEX files under Documentation/w1 plus typo fixes.

Signed-off-by: Rob Landley <[EMAIL PROTECTED]>
---

 Documentation/w1/masters/ds2482 |2 +-
 Documentation/w1/masters/ds2490 |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff -r de183081194a Documentation/w1/masters/ds2482
--- a/Documentation/w1/masters/ds2482   Tue Oct 02 18:00:56 2007 +
+++ b/Documentation/w1/masters/ds2482   Wed Oct 03 20:28:05 2007 -0500
@@ -15,7 +15,7 @@ Description
 Description
 ---
 
-The Maixm/Dallas Semiconductor DS2482 is a I2C device that provides

+The Maxim/Dallas Semiconductor DS2482 is a I2C device that provides
 one (DS2482-100) or eight (DS2482-800) 1-wire busses.
 
 
diff -r de183081194a Documentation/w1/masters/ds2490

--- a/Documentation/w1/masters/ds2490   Tue Oct 02 18:00:56 2007 +
+++ b/Documentation/w1/masters/ds2490   Wed Oct 03 20:28:05 2007 -0500
@@ -10,7 +10,7 @@ Description
 Description
 ---
 
-The Maixm/Dallas Semiconductor DS2490 is a chip

+The Maxim/Dallas Semiconductor DS2490 is a chip
 which allows to build USB <-> W1 bridges.
 
 DS9490(R) is a USB <-> W1 bus master device

--- /dev/null   2007-04-23 10:59:00.0 -0500
+++ hg/Documentation/w1/00-INDEX2007-10-03 20:26:38.0 -0500
@@ -0,0 +1,8 @@
+00-INDEX
+   - This file
+masters/
+   - Individual chips providing 1-wire busses.
+w1.generic
+   - The 1-wire (w1) bus
+w1.netlink
+   - Userspace communication protocol over connector [1].
--- /dev/null   2007-04-23 10:59:00.0 -0500
+++ hg/Documentation/w1/masters/00-INDEX2007-10-03 20:26:55.0 
-0500
@@ -0,0 +1,6 @@
+00-INDEX
+   - This file
+ds2482
+   - The Maxim/Dallas Semiconductor DS2482 provides 1-wire busses.
+ds2490
+   - The Maxim/Dallas Semiconductor DS2490 builds USB <-> W1 bridges.



--
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-03 Thread David Chinner

On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
> On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
> > On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> > > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > > > >   wbc.pages_skipped = 0;
> > > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > > > >   min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > > > >   if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > > > >   /* Wrote less than expected */
> > > > > - congestion_wait(WRITE, HZ/10);
> > > > > - if (!wbc.encountered_congestion)
> > > > > + if (wbc.encountered_congestion || wbc.more_io)
> > > > > + congestion_wait(WRITE, HZ/10);
> > > > > + else
> > > > >   break;
> > > > >   }
> > > > 
> > > > Why do you call congestion_wait() if there is more I/O to issue?  If
> > > > we have a fast filesystem, this might cause the device queues to
> > > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > > > will have trouble keeping the queues full, right?
> > > 
> > > You mean slow writers and fast RAID? That would be exactly the case
> > > these patches try to improve.
> > 
> > I mean any writers and a fast block device (raid or otherwise).
> > 
> > > This patchset makes kupdate/background writeback more responsible,
> > > so that if (avg-write-speed < device-capabilities), the dirty data are
> > > synced timely, and we don't have to go for balance_dirty_pages().
> > 
> > Sure, but I'm asking about the effect of the patches on the
> > (avg-write-speed == device-capabilities) case. I agree that
> > they are necessary for timely syncing of data but I'm trying
> > to understand what effect they have on the normal write case
> 
> > (i.e. keeping the disk at full write throughput).
> 
> OK, I guess it is the focus of all your questions: Why should we sleep
> in congestion_wait() and possibly hurt the write throughput? I'll try
> to summary it:
> 
> - congestion_wait() is necessary
> Besides device congestions, there may be other blockades we have to
> wait on, e.g. temporary page locks, NFS/journal issues(I guess).

We skip locked pages in writeback, and if some filesystems have
blocking issues that require non-blocking writeback waits for some
I/O to complete before re-entering writeback, then perhaps they should be
setting wbc->encountered_congestion to tell writeback to back off.

The question I'm asking is that if more_io tells us we have more
work to do, why do we have to sleep first if the block dev is
able to take more I/O?

> 
> - congestion_wait() is called only when necessary
> congestion_wait() will only be called we saw blockades:
> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> congestion_wait(WRITE, HZ/10);
> }
> So in normal case, it may well write 128MB data without any waiting.

Sure, but wbc.more_io doesn't indicate a blockade - just that there
is more work to do, right?

> - congestion_wait() won't hurt write throughput
> When not congested, congestion_wait() will be wake up on each write
> completion.

What happens if there I/O we issued has already completed before we
got back up to the congestion_wait() call? We'll spend 100ms
sleeping when we shouldn't have and throughput goes down by 10% on
every occurrence

if we've got more work to do, then we should do it without an
arbitrary, non-deterministic delay being inserted. If the delay is
needed to prevent he system from "going mad" (whatever tht means),
then what's the explaination for the system "going mad"?

> Note that MAX_WRITEBACK_PAGES=1024 and
> /sys/block/sda/queue/max_sectors_kb=512(for me),
> which means we are gave the chance to sync 4MB on every 512KB written,
> which means we are able to submit write IOs 8 times faster than the
> device capability. congestion_wait() is a magical timer :-)

So, with Jens Axboe's sglist chaining, that single I/O could now
be up to 32MB on some hardware. IOWs, we push 1024 pages, and that
could end up as a single I/O being issued to disk.

Your magic just broke. :/

> > > So for your question of queue depth, the answer is: the queue length
> > > will not build up in the first place. 
> > 
> > Which queue are you talking about here? The block deivce queue?
> 
> Yes, the elevator's queues.

I think this is the wrong thing to be doing and is detrimental
to I/o perfomrance because it wil reduce elevator efficiency.

The elevator can only work efficiently if we allow the queues to
build up. The deeper the queue, the better the elevator can sort the
I/o requests and keep the device at maximum efficiency.  If we don't
push enough I/O into the queues the we miss opportunities to combine

Re: [14/18] Configure stack size

2007-10-03 Thread David Miller

From: Arjan van de Ven <[EMAIL PROTECTED]>
Date: Wed, 3 Oct 2007 21:36:31 -0700

> there is still code that does DMA from and to the stack
> how would this work with virtual allocated stack?

That's a bug and must be fixed.

There honestly shouldn't be that many examples around.

FWIW, there are platforms using a virtually allocated kernel stack
already.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [14/18] Configure stack size

2007-10-03 Thread Arjan van de Ven

On Wed, 03 Oct 2007 20:59:49 -0700
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> Make the stack size configurable now that we can fallback to vmalloc
> if necessary. SGI NUMA configurations may need more stack because
> cpumasks and nodemasks are at times kept on the stack. With the
> coming 16k cpu support this is going to be 2k just for the mask. This
> patch allows to run with 16k or 32k kernel stacks on x86_74.

there is still code that does DMA from and to the stack
how would this work with virtual allocated stack?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] Sysace: sparse fixes

2007-10-03 Thread Grant Likely

From: Grant Likely <[EMAIL PROTECTED]>

Signed-off-by: Grant Likely <[EMAIL PROTECTED]>
---

 drivers/block/xsysace.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c
index 3847464..5b73471 100644
--- a/drivers/block/xsysace.c
+++ b/drivers/block/xsysace.c
@@ -195,7 +195,7 @@ struct ace_device {
 
/* Details of hardware device */
unsigned long physaddr;
-   void *baseaddr;
+   void __iomem *baseaddr;
int irq;
int bus_width;  /* 0 := 8 bit; 1 := 16 bit */
struct ace_reg_ops *reg_ops;
@@ -227,20 +227,20 @@ struct ace_reg_ops {
 /* 8 Bit bus width */
 static u16 ace_in_8(struct ace_device *ace, int reg)
 {
-   void *r = ace->baseaddr + reg;
+   void __iomem *r = ace->baseaddr + reg;
return in_8(r) | (in_8(r + 1) << 8);
 }
 
 static void ace_out_8(struct ace_device *ace, int reg, u16 val)
 {
-   void *r = ace->baseaddr + reg;
+   void __iomem *r = ace->baseaddr + reg;
out_8(r, val);
out_8(r + 1, val >> 8);
 }
 
 static void ace_datain_8(struct ace_device *ace)
 {
-   void *r = ace->baseaddr + 0x40;
+   void __iomem *r = ace->baseaddr + 0x40;
u8 *dst = ace->data_ptr;
int i = ACE_FIFO_SIZE;
while (i--)
@@ -250,7 +250,7 @@ static void ace_datain_8(struct ace_device *ace)
 
 static void ace_dataout_8(struct ace_device *ace)
 {
-   void *r = ace->baseaddr + 0x40;
+   void __iomem *r = ace->baseaddr + 0x40;
u8 *src = ace->data_ptr;
int i = ACE_FIFO_SIZE;
while (i--)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Balbir Singh

Hugh Dickins wrote:
> On Wed, 3 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> Sorry, Balbir, I've failed to get back to you, still attending to
>>> priorities.  Let me briefly summarize my issue with the mem controller:
>>> you've not yet given enough attention to swap.
>> I am open to suggestions and ways and means of making swap control
>> complete and more usable.
> 
> Well, swap control is another subject.  I guess for that you'll need
> to track which cgroup each swap page belongs to (rather more expensive
> than the current swap_map of unsigned shorts).  And I doubt it'll be
> swap control as such that's required, but control of rss+swap.
> 

I see what you mean now, other people have recommending a per cgroup
swap file/device.

> But here I'm just worrying about how the existence of swap makes
> something of a nonsense of your rss control.
> 

Ideally, pages would not reside for too long in swap cache (unless
I've misunderstood swap cache or there are special cases for tmpfs/
ramfs). Once pages have been swapped back in, they get assigned
back to their respective cgroup's in do_swap_page() (where we charge
them back to the cgroup).

The swap cache pages will be the first ones to go, once the cgroup
exceeds its limit.

There might be gaps in my understanding or I might be missing a use
case scenario, where things work differently.

>>> I accept that full swap control is something you're intending to add
>>> incrementally later; but the current state doesn't make sense to me.
>>>
>>> The problems are swapoff and swapin readahead.  These pull pages into
>>> the swap cache, which are assigned to the cgroup (or the whatever-we-
>>> call-the-remainder-outside-all-the-cgroups) which is running swapoff
>  ^
> I'd appreciate it if you'd teach me the right name for that!
> 

In the past people have used names like default cgroup, we could use
the root cgroup as the default cgroup.

>>> or faulting in its own page; yet they very clearly don't (in general)
>>> belong to that cgroup, but to other cgroups which will be discovered
>>> later.
>> I understand what your trying to say, but with several approaches that
>> we tried in the past, we found caches the hardest to most accurately
>> account. IIRC, with readahead, we don't even know if all the pages
>> readahead will be used, that's why we charge everything to the cgroup
>> that added the page to the cache.
> 
> Yes, readahead is anyway problematic.  My guess is that in the file
> cache case, you'll tend not to go too far wrong by charging to the
> one that added - though we're all aware that's fairly unsatisfactory.
> 
> My point is that in the swap cache case, it's badly wrong: there's
> no page more obviously owned by a cgroup than its anonymous pages
> (forgetting for a moment that minority shared between cgroups
> until copy-on-write), so it's very wrong for swapin readahead
> or swapoff to go charging those to another or to no cgroup.
> 
> Imagine a cgroup at its rss limit, with more out on swap.  Then
> another cgroup does some swap readahead, bringing pages private
> to the first into cache.  Or runs swapoff which actually plugs
> them into the rss of the first cgroup, so it goes over limit.
> 
> Those are pages we'd want to swap out when the first cgroup
> faults to go further over its limit; but they're now not even
> identified as belonging to the right cgroup, so won't be found.
> 

Won't the right cgroup assignment happen as discussed above?

>>> I did try removing the cgroup mods to mm/swap_state.c, so swap pages
>>> get assigned to a cgroup only once it's really known; but that's not
>>> enough by itself, because cgroup RSS reclaim doesn't touch those
>>> pages, so the cgroup can easily OOM much too soon.  I was thinking
>>> that you need a "limbo" cgroup for these pages, which can be attacked
>>> for reclaim along with any cgroup being reclaimed, but from which
>>> pages are readily migrated to their real cgroup once that's known.
>>>
>> Is migrating the charge to the real cgroup really required?
> 
> My answer is definitely yes.  I'm not suggesting that you need
> general migration between cgroups at this stage (something for
> later quite likely); but I am suggesting you need one pseudo-cgroup
> to hold these cases temporarily, and that you cannot properly track
> rss without it (if there is any swap).
> 

If what I understand and discussed earlier is, then we don't need
to go this route. But I think the idea of having a pseduo cgroup
is interesting (needs more thought).

>>> So in the current memory controller, that unuse_pte mem charge I was
>>> originally worried about failing (I hadn't at that point delved in
>>> to see how it tries to reclaim) actually never fails (and never
>>> does anything): the page is already assigned to some cgroup-or-
>>> whatever and is never charged to vma->vm_mm at that point.
>>>
>> Excellent!
> 
> Umm, please explain what's excellent about that.
>

[PATCH 3/3] Sysace: Don't enable IRQ until after interrupt handler is registered

2007-10-03 Thread Grant Likely

From: Grant Likely <[EMAIL PROTECTED]>

The previous patch to move the interrupt handler registration moved it
below enabling interrupts which could be a problem if the device is on
a shared interrupt line.  This patch fixes the order.

Signed-off-by: Grant Likely <[EMAIL PROTECTED]>
---

 drivers/block/xsysace.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c
index 5b73471..9e7652d 100644
--- a/drivers/block/xsysace.c
+++ b/drivers/block/xsysace.c
@@ -1005,11 +1005,6 @@ static int __devinit ace_setup(struct ace_device *ace)
ace_out(ace, ACE_CTRL, ACE_CTRL_FORCECFGMODE |
ACE_CTRL_DATABUFRDYIRQ | ACE_CTRL_ERRORIRQ);
 
-   /* Enable interrupts */
-   val = ace_in(ace, ACE_CTRL);
-   val |= ACE_CTRL_DATABUFRDYIRQ | ACE_CTRL_ERRORIRQ;
-   ace_out(ace, ACE_CTRL, val);
-
/* Now we can hook up the irq handler */
if (ace->irq != NO_IRQ) {
rc = request_irq(ace->irq, ace_interrupt, 0, "systemace", ace);
@@ -1020,6 +1015,11 @@ static int __devinit ace_setup(struct ace_device *ace)
}
}
 
+   /* Enable interrupts */
+   val = ace_in(ace, ACE_CTRL);
+   val |= ACE_CTRL_DATABUFRDYIRQ | ACE_CTRL_ERRORIRQ;
+   ace_out(ace, ACE_CTRL, val);
+
/* Print the identification */
dev_info(ace->dev, "Xilinx SystemACE revision %i.%i.%i\n",
 (version >> 12) & 0xf, (version >> 8) & 0x0f, version & 0xff);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] Sysace: Minor coding convention fixup

2007-10-03 Thread Grant Likely

From: Grant Likely <[EMAIL PROTECTED]>

Put function call and return code test on separate lines.

Signed-off-by: Grant Likely <[EMAIL PROTECTED]>
---

 drivers/block/xsysace.c |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c
index 3ea172b..3847464 100644
--- a/drivers/block/xsysace.c
+++ b/drivers/block/xsysace.c
@@ -1091,7 +1091,8 @@ ace_alloc(struct device *dev, int id, unsigned long 
physaddr,
ace->bus_width = bus_width;
 
/* Call the setup code */
-   if ((rc = ace_setup(ace)) != 0)
+   rc = ace_setup(ace);
+   if (rc)
goto err_setup;
 
dev_set_drvdata(dev, ace);
@@ -1253,11 +1254,13 @@ static int __init ace_init(void)
goto err_blk;
}
 
-   if ((rc = ace_of_register()) != 0)
+   rc = ace_of_register();
+   if (rc)
goto err_of;
 
pr_debug("xsysace: registering platform binding\n");
-   if ((rc = platform_driver_register(_platform_driver)) != 0)
+   rc = platform_driver_register(_platform_driver);
+   if (rc)
goto err_plat;
 
pr_info("Xilinx SystemACE device driver, major=%i\n", ace_major);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[15/18] Fallback for temporary order 2 allocation

2007-10-03 Thread Christoph Lameter

The cryto subsystem needs an order 2 allocation. This is a temporary buffer
for xoring data so we can safely allow fallback.

Cc: Dan Williams <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 crypto/xor.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/crypto/xor.c
===
--- linux-2.6.orig/crypto/xor.c 2007-10-03 18:11:20.0 -0700
+++ linux-2.6/crypto/xor.c  2007-10-03 18:12:14.0 -0700
@@ -101,7 +101,7 @@ calibrate_xor_blocks(void)
void *b1, *b2;
struct xor_block_template *f, *fastest;
 
-   b1 = (void *) __get_free_pages(GFP_KERNEL, 2);
+   b1 = (void *) __get_free_pages(GFP_VFALLBACK, 2);
if (!b1) {
printk(KERN_WARNING "xor: Yikes!  No memory available.\n");
return -ENOMEM;

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] Fixups to SystemACE driver

2007-10-03 Thread Grant Likely

Jens,

Here are some more Sysace patches based on comments received on the
first series and a run through sparse.  Can you please queue them up
for 2.6.24?

Thanks,
g.

--
Grant Likely, B.Sc. P.Eng.
Secret Lab Technologies Ltd.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[11/18] Page allocator: Use a higher order allocation for the zone wait table.

2007-10-03 Thread Christoph Lameter

Currently vmalloc is used for the zone wait table. Therefore the vmalloc
page tables have to be consulted by the MMU to access the wait table.
We can now use GFP_VFALLBACK to attempt the use of a physically contiguous
page that can then use the large kernel TLBs.

Drawback: The zone wait table is rounded up to the next power of two which
may cost some memory.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/page_alloc.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-10-03 18:07:16.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-10-03 18:07:20.0 -0700
@@ -2585,7 +2585,9 @@ int zone_wait_table_init(struct zone *zo
 * To use this new node's memory, further consideration will be
 * necessary.
 */
-   zone->wait_table = (wait_queue_head_t *)vmalloc(alloc_size);
+   zone->wait_table = (wait_queue_head_t *)
+   __get_free_pages(GFP_VFALLBACK,
+   get_order(alloc_size));
}
if (!zone->wait_table)
return -ENOMEM;

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[18/18] SLUB: Use fallback for table of callers/freers of a slab cache

2007-10-03 Thread Christoph Lameter

The caller table can get quite large if there are many call sites for a
particular slab. Add GFP_FALLBACK allows falling back to vmalloc in case
the caller table gets too big and memory is fragmented. Currently we
would fail the operation.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/slub.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-10-03 20:00:23.0 -0700
+++ linux-2.6/mm/slub.c 2007-10-03 20:01:12.0 -0700
@@ -3003,7 +3003,8 @@ static int alloc_loc_track(struct loc_tr
 
order = get_order(sizeof(struct location) * max);
 
-   l = (void *)__get_free_pages(flags, order);
+   l = (void *)__get_free_pages(flags | __GFP_COMP | __GFP_VFALLBACK,
+   order);
if (!l)
return 0;
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[12/18] Wait: Allow bit_waitqueue to wait on a bit in a virtual compound page

2007-10-03 Thread Christoph Lameter

If bit waitqueue is passed a virtual address then it must use
virt_to_head_page instead of virt_to_page.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 kernel/wait.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/kernel/wait.c
===
--- linux-2.6.orig/kernel/wait.c2007-10-03 17:44:21.0 -0700
+++ linux-2.6/kernel/wait.c 2007-10-03 17:53:07.0 -0700
@@ -245,7 +245,7 @@ EXPORT_SYMBOL(wake_up_bit);
 fastcall wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
const int shift = BITS_PER_LONG == 32 ? 5 : 6;
-   const struct zone *zone = page_zone(virt_to_page(word));
+   const struct zone *zone = page_zone(virt_to_head_page(word));
unsigned long val = (unsigned long)word << shift | bit;
 
return >wait_table[hash_long(val, zone->wait_table_bits)];

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[16/18] Virtual Compound page allocation from interrupt context.

2007-10-03 Thread Christoph Lameter

In an interrupt context we cannot wait for the vmlist_lock in
__get_vm_area_node(). So use a trylock instead. If the trylock fails
then the atomic allocation will fail and subsequently be retried.

This only works because the flush_cache_vunmap in use for
allocation is never performing any IPIs in contrast to flush_tlb_...
in use for freeing.  flush_cache_vunmap is only used on architectures
with a virtually mapped cache (xtensa, pa-risc).

[Note: Nick Piggin is working on a scheme to make this simpler by
no longer requiring flushes]

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/vmalloc.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-10-03 16:21:10.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-10-03 16:25:17.0 -0700
@@ -177,7 +177,6 @@ static struct vm_struct *__get_vm_area_n
unsigned long align = 1;
unsigned long addr;
 
-   BUG_ON(in_interrupt());
if (flags & VM_IOREMAP) {
int bit = fls(size);
 
@@ -202,7 +201,14 @@ static struct vm_struct *__get_vm_area_n
 */
size += PAGE_SIZE;
 
-   write_lock(_lock);
+   if (gfp_mask & __GFP_WAIT)
+   write_lock(_lock);
+   else {
+   if (!write_trylock(_lock)) {
+   kfree(area);
+   return NULL;
+   }
+   }
for (p =  (tmp = *p) != NULL ;p = >next) {
if ((unsigned long)tmp->addr < addr) {
if((unsigned long)tmp->addr + tmp->size >= addr)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[17/18] Virtual compound page freeing in interrupt context

2007-10-03 Thread Christoph Lameter

If we are in an interrupt context then simply defer the free via a workqueue.

Removing a virtual mappping *must* be done with interrupts enabled
since tlb_xx functions are called that rely on interrupts for
processor to processor communications.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/page_alloc.c |   12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-10-03 20:00:37.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-10-03 20:01:09.0 -0700
@@ -294,10 +294,20 @@ static void __free_vcompound(void *addr)
kfree(pages);
 }
 
+static void vcompound_free_work(struct work_struct *w)
+{
+   __free_vcompound((void *)w);
+}
 
 static void free_vcompound(void *addr)
 {
-   __free_vcompound(addr);
+   struct work_struct *w = addr;
+
+   if (irqs_disabled() || in_interrupt()) {
+   INIT_WORK(w, vcompound_free_work);
+   schedule_work(w);
+   } else
+   __free_vcompound(w);
 }
 
 static void free_compound_page(struct page *page)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[09/18] Vcompound: GFP_VFALLBACK debugging aid

2007-10-03 Thread Christoph Lameter

Virtual fallbacks are rare and thus subtle bugs may creep in if we do not
test the fallbacks. CONFIG_VFALLBACK_ALWAYS makes all GFP_VFALLBACK
allocations fall back to virtual mapping.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 lib/Kconfig.debug |   11 +++
 mm/page_alloc.c   |6 ++
 2 files changed, 17 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-10-03 18:04:33.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-10-03 18:07:16.0 -0700
@@ -1257,6 +1257,12 @@ zonelist_scan:
}
}
 
+#ifdef CONFIG_VFALLBACK_ALWAYS
+   if ((gfp_mask & __GFP_VFALLBACK) &&
+   system_state == SYSTEM_RUNNING)
+   return alloc_vcompound(gfp_mask, order,
+   zonelist, alloc_flags);
+#endif
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
if (page)
break;
Index: linux-2.6/lib/Kconfig.debug
===
--- linux-2.6.orig/lib/Kconfig.debug2007-10-03 18:04:29.0 -0700
+++ linux-2.6/lib/Kconfig.debug 2007-10-03 18:07:16.0 -0700
@@ -105,6 +105,17 @@ config DETECT_SOFTLOCKUP
   can be detected via the NMI-watchdog, on platforms that
   support it.)
 
+config VFALLBACK_ALWAYS
+   bool "Always fall back to Virtual Compound pages"
+   default y
+   help
+ Virtual compound pages are only allocated if there is no linear
+ memory available. They are a fallback and errors created by the
+ use of virtual mappings instead of linear ones may not surface
+ because of their infrequent use. This option makes every
+ allocation that allows a fallback to a virtual mapping use
+ the virtual mapping. May have a significant performance impact.
+
 config SCHED_DEBUG
bool "Collect scheduler debugging info"
depends on DEBUG_KERNEL && PROC_FS

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[10/18] Sparsemem: Use fallback for the memmap.

2007-10-03 Thread Christoph Lameter

Sparsemem currently attempts first to do a physically contiguous mapping
and then falls back to vmalloc. The same thing can now be accomplished
using GFP_VFALLBACK.

Cc: [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/sparse.c |   33 +++--
 1 file changed, 3 insertions(+), 30 deletions(-)

Index: linux-2.6/mm/sparse.c
===
--- linux-2.6.orig/mm/sparse.c  2007-10-02 22:02:58.0 -0700
+++ linux-2.6/mm/sparse.c   2007-10-02 22:19:58.0 -0700
@@ -269,40 +269,13 @@ void __init sparse_init(void)
 #ifdef CONFIG_MEMORY_HOTPLUG
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
 {
-   struct page *page, *ret;
-   unsigned long memmap_size = sizeof(struct page) * nr_pages;
-
-   page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
-   if (page)
-   goto got_map_page;
-
-   ret = vmalloc(memmap_size);
-   if (ret)
-   goto got_map_ptr;
-
-   return NULL;
-got_map_page:
-   ret = (struct page *)pfn_to_kaddr(page_to_pfn(page));
-got_map_ptr:
-   memset(ret, 0, memmap_size);
-
-   return ret;
-}
-
-static int vaddr_in_vmalloc_area(void *addr)
-{
-   if (addr >= (void *)VMALLOC_START &&
-   addr < (void *)VMALLOC_END)
-   return 1;
-   return 0;
+   return (struct page *)__get_free_pages(GFP_VFALLBACK,
+   get_order(memmap_size));
 }
 
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-   if (vaddr_in_vmalloc_area(memmap))
-   vfree(memmap);
-   else
-   free_pages((unsigned long)memmap,
+   free_pages((unsigned long)memmap,
   get_order(sizeof(struct page) * nr_pages));
 }
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[13/18] x86_64: Allow fallback for the stack

2007-10-03 Thread Christoph Lameter

Peter Zijlstra has recently demonstrated that we can have order 1 allocation
failures under memory pressure with small memory configurations. The
x86_64 stack has a size of 8k and thus requires a order 1 allocation.

This patch adds a virtual fallback capability for the stack. The system may
continue even in extreme situations and we may be able to increase the stack
size if necessary (see next patch).

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/asm-x86_64/thread_info.h |   16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

Index: linux-2.6/include/asm-x86_64/thread_info.h
===
--- linux-2.6.orig/include/asm-x86_64/thread_info.h 2007-10-03 
14:49:48.0 -0700
+++ linux-2.6/include/asm-x86_64/thread_info.h  2007-10-03 14:51:00.0 
-0700
@@ -74,20 +74,14 @@ static inline struct thread_info *stack_
 
 /* thread information allocation */
 #ifdef CONFIG_DEBUG_STACK_USAGE
-#define alloc_thread_info(tsk) \
-({ \
-   struct thread_info *ret;\
-   \
-   ret = ((struct thread_info *) 
__get_free_pages(GFP_KERNEL,THREAD_ORDER)); \
-   if (ret)\
-   memset(ret, 0, THREAD_SIZE);\
-   ret;\
-})
+#define THREAD_FLAGS (GFP_VFALLBACK | __GFP_ZERO)
 #else
-#define alloc_thread_info(tsk) \
-   ((struct thread_info *) __get_free_pages(GFP_KERNEL,THREAD_ORDER))
+#define THREAD_FLAGS GFP_VFALLBACK
 #endif
 
+#define alloc_thread_info(tsk) \
+   ((struct thread_info *) __get_free_pages(THREAD_FLAGS, THREAD_ORDER))
+
 #define free_thread_info(ti) free_pages((unsigned long) (ti), THREAD_ORDER)
 
 #else /* !__ASSEMBLY__ */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[14/18] Configure stack size

2007-10-03 Thread Christoph Lameter

Make the stack size configurable now that we can fallback to vmalloc if
necessary. SGI NUMA configurations may need more stack because cpumasks
and nodemasks are at times kept on the stack. With the coming 16k cpu
support this is going to be 2k just for the mask. This patch allows to
run with 16k or 32k kernel stacks on x86_74.

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 arch/x86_64/Kconfig  |6 ++
 include/asm-x86_64/page.h|3 +--
 include/asm-x86_64/thread_info.h |4 ++--
 3 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86_64/Kconfig
===
--- linux-2.6.orig/arch/x86_64/Kconfig  2007-10-03 18:11:20.0 -0700
+++ linux-2.6/arch/x86_64/Kconfig   2007-10-03 18:12:13.0 -0700
@@ -363,6 +363,12 @@ config NODES_SHIFT
default "6"
depends on NEED_MULTIPLE_NODES
 
+config THREAD_ORDER
+   int "Kernel stack size (in page order)"
+   default "1"
+   help
+ Page order for the thread stack.
+
 # Dummy CONFIG option to select ACPI_NUMA from drivers/acpi/Kconfig.
 
 config X86_64_ACPI_NUMA
Index: linux-2.6/include/asm-x86_64/page.h
===
--- linux-2.6.orig/include/asm-x86_64/page.h2007-10-03 18:11:20.0 
-0700
+++ linux-2.6/include/asm-x86_64/page.h 2007-10-03 18:12:13.0 -0700
@@ -9,8 +9,7 @@
 #define PAGE_MASK  (~(PAGE_SIZE-1))
 #define PHYSICAL_PAGE_MASK (~(PAGE_SIZE-1) & __PHYSICAL_MASK)
 
-#define THREAD_ORDER 1 
-#define THREAD_SIZE  (PAGE_SIZE << THREAD_ORDER)
+#define THREAD_SIZE  (PAGE_SIZE << CONFIG_THREAD_ORDER)
 #define CURRENT_MASK (~(THREAD_SIZE-1))
 
 #define EXCEPTION_STACK_ORDER 0
Index: linux-2.6/include/asm-x86_64/thread_info.h
===
--- linux-2.6.orig/include/asm-x86_64/thread_info.h 2007-10-03 
18:12:13.0 -0700
+++ linux-2.6/include/asm-x86_64/thread_info.h  2007-10-03 18:12:13.0 
-0700
@@ -80,9 +80,9 @@ static inline struct thread_info *stack_
 #endif
 
 #define alloc_thread_info(tsk) \
-   ((struct thread_info *) __get_free_pages(THREAD_FLAGS, THREAD_ORDER))
+   ((struct thread_info *) __get_free_pages(THREAD_FLAGS, 
CONFIG_THREAD_ORDER))
 
-#define free_thread_info(ti) free_pages((unsigned long) (ti), THREAD_ORDER)
+#define free_thread_info(ti) free_pages((unsigned long) (ti), 
CONFIG_THREAD_ORDER)
 
 #else /* !__ASSEMBLY__ */
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[06/18] Vcompound: Update page address determination

2007-10-03 Thread Christoph Lameter

Make page_address() correctly determine the address of a potentially
virtually mapped compound page.

There are 3 cases to consider:

1. !HASHED_PAGE_VIRTUAL && !WANT_PAGE_VIRTUAL

Call vmalloc_address() directly from the page_address function
defined in mm.h.

2. HASHED_PAGE_VIRTUAL

Modify page_address() in highmem.c to call vmalloc_address().

3. WANT_PAGE_VIRTUAL

set_page_address() is used to set up the virtual addresses of
all pages that are part of the virtual compound.

Cc: [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mm.h |9 -
 mm/highmem.c   |   10 --
 2 files changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-10-03 19:39:52.0 -0700
+++ linux-2.6/include/linux/mm.h2007-10-03 19:40:29.0 -0700
@@ -605,7 +605,14 @@ void page_address_init(void);
 #endif
 
 #if !defined(HASHED_PAGE_VIRTUAL) && !defined(WANT_PAGE_VIRTUAL)
-#define page_address(page) lowmem_page_address(page)
+
+static inline void *page_address(struct page *page)
+{
+   if (unlikely(PageVcompound(page)))
+   return vmalloc_address(page);
+   return lowmem_page_address(page);
+}
+
 #define set_page_address(page, address)  do { } while(0)
 #define page_address_init()  do { } while(0)
 #endif
Index: linux-2.6/mm/highmem.c
===
--- linux-2.6.orig/mm/highmem.c 2007-10-03 19:39:25.0 -0700
+++ linux-2.6/mm/highmem.c  2007-10-03 19:40:29.0 -0700
@@ -265,8 +265,11 @@ void *page_address(struct page *page)
void *ret;
struct page_address_slot *pas;
 
-   if (!PageHighMem(page))
+   if (!PageHighMem(page)) {
+   if (PageVcompound(page))
+   return vmalloc_address(page);
return lowmem_page_address(page);
+   }
 
pas = page_slot(page);
ret = NULL;
@@ -294,7 +297,10 @@ void set_page_address(struct page *page,
struct page_address_slot *pas;
struct page_address_map *pam;
 
-   BUG_ON(!PageHighMem(page));
+   if (!PageHighMem(page)) {
+   BUG_ON(!PageVcompound(page));
+   return;
+   }
 
pas = page_slot(page);
if (virtual) {  /* Add */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[08/18] GFP_VFALLBACK: Allow fallback of compound pages to virtual mappings

2007-10-03 Thread Christoph Lameter

Add a new gfp flag

__GFP_VFALLBACK

If specified during a higher order allocation then the system will fall
back to vmap if no physically contiguous pages can be found. This will
create a virtually contiguous area instead of a physically contiguous area.
In many cases the virtually contiguous area can stand in for the physically
contiguous area (with some loss of performance).

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/gfp.h |5 +
 mm/page_alloc.c |  139 ++--
 2 files changed, 139 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-10-03 19:44:07.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-10-03 19:44:08.0 -0700
@@ -60,6 +60,9 @@ long nr_swap_pages;
 int percpu_pagelist_fraction;
 
 static void __free_pages_ok(struct page *page, unsigned int order);
+static struct page *alloc_vcompound(gfp_t, int,
+   struct zonelist *, unsigned long);
+static void destroy_compound_page(struct page *page, unsigned long order);
 
 /*
  * results with 256, 32 in the lowmem_reserve sysctl:
@@ -260,9 +263,51 @@ static void bad_page(struct page *page)
  * This usage means that zero-order pages may not be compound.
  */
 
+static void __free_vcompound(void *addr)
+{
+   struct page **pages;
+   int i;
+   struct page *page = vmalloc_to_page(addr);
+   int order = compound_order(page);
+   int nr_pages = 1 << order;
+
+   if (!PageVcompound(page) || !PageHead(page)) {
+   bad_page(page);
+   return;
+   }
+   destroy_compound_page(page, order);
+   pages = vunmap(addr);
+   /*
+* First page will have zero refcount since it maintains state
+* for the compound and was decremented before we got here.
+*/
+   set_page_address(page, NULL);
+   __ClearPageVcompound(page);
+   free_hot_page(page);
+
+   for (i = 1; i < nr_pages; i++) {
+   page = pages[i];
+   set_page_address(page, NULL);
+   __ClearPageVcompound(page);
+   __free_page(page);
+   }
+   kfree(pages);
+}
+
+
+static void free_vcompound(void *addr)
+{
+   __free_vcompound(addr);
+}
+
 static void free_compound_page(struct page *page)
 {
-   __free_pages_ok(page, compound_order(page));
+   if (PageVcompound(page))
+   free_vcompound(page_address(page));
+   else {
+   destroy_compound_page(page, compound_order(page));
+   __free_pages_ok(page, compound_order(page));
+   }
 }
 
 static void prep_compound_page(struct page *page, unsigned long order)
@@ -1259,6 +1304,67 @@ try_next_zone:
 }
 
 /*
+ * Virtual Compound Page support.
+ *
+ * Virtual Compound Pages are used to fall back to order 0 allocations if large
+ * linear mappings are not available and __GFP_VFALLBACK is set. They are
+ * formatted according to compound page conventions. I.e. following
+ * page->first_page if PageTail(page) is set can be used to determine the
+ * head page.
+ */
+static noinline struct page *alloc_vcompound(gfp_t gfp_mask, int order,
+   struct zonelist *zonelist, unsigned long alloc_flags)
+{
+   struct page *page;
+   int i;
+   struct vm_struct *vm;
+   int nr_pages = 1 << order;
+   struct page **pages = kmalloc(nr_pages * sizeof(struct page *),
+   gfp_mask & GFP_LEVEL_MASK);
+   struct page **pages2;
+
+   if (!pages)
+   return NULL;
+
+   gfp_mask &= ~(__GFP_COMP | __GFP_VFALLBACK);
+   for (i = 0; i < nr_pages; i++) {
+   page = get_page_from_freelist(gfp_mask, 0, zonelist,
+   alloc_flags);
+   if (!page)
+   goto abort;
+
+   /* Sets PageCompound which makes PageHead(page) true */
+   __SetPageVcompound(page);
+   pages[i] = page;
+   }
+
+   vm = get_vm_area_node(nr_pages << PAGE_SHIFT, VM_MAP,
+   zone_to_nid(zonelist->zones[0]), gfp_mask);
+   pages2 = pages;
+   if (map_vm_area(vm, PAGE_KERNEL, ))
+   goto abort;
+
+   prep_compound_page(pages[0], order);
+
+   for (i = 0; i < nr_pages; i++)
+   set_page_address(pages[0], vm->addr + (i << PAGE_SHIFT));
+
+   return pages[0];
+
+abort:
+   while (i-- > 0) {
+   page = pages[i];
+   if (!page)
+   continue;
+   set_page_address(page, NULL);
+   __ClearPageVcompound(page);
+   __free_page(page);
+   }
+   kfree(pages);
+   return NULL;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
@@ -1353,12 +1459,12 @@

[04/18] Vcompound: Smart up virt_to_head_page()

2007-10-03 Thread Christoph Lameter

The determination of a page struct for an address in a compound page
will need some more smarts in order to deal with virtual addresses.

We need to use the evil constants VMALLOC_START and VMALLOC_END for this
and they are notoriously for referencing various arch header files or may
even be variables. Uninline the function to avoid trouble.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mm.h |6 +-
 mm/page_alloc.c|   23 +++
 2 files changed, 24 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-10-03 19:21:50.0 -0700
+++ linux-2.6/include/linux/mm.h2007-10-03 19:23:08.0 -0700
@@ -315,11 +315,7 @@ static inline void get_page(struct page 
atomic_inc(>_count);
 }
 
-static inline struct page *virt_to_head_page(const void *x)
-{
-   struct page *page = virt_to_page(x);
-   return compound_head(page);
-}
+struct page *virt_to_head_page(const void *x);
 
 /*
  * Setup the page count before being freed into the page allocator for
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-10-03 19:21:50.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-10-03 19:23:08.0 -0700
@@ -150,6 +150,29 @@ int nr_node_ids __read_mostly = MAX_NUMN
 EXPORT_SYMBOL(nr_node_ids);
 #endif
 
+/*
+ * Determine the appropriate page struct given a virtual address
+ * (including vmalloced areas).
+ *
+ * Return the head page if this is a compound page.
+ *
+ * Cannot be inlined since VMALLOC_START and VMALLOC_END may contain
+ * complex calculations that depend on multiple arch includes or
+ * even variables.
+ */
+struct page *virt_to_head_page(const void *x)
+{
+   unsigned long addr = (unsigned long)x;
+   struct page *page;
+
+   if (unlikely(addr >= VMALLOC_START && addr < VMALLOC_END))
+   page = vmalloc_to_page((void *)addr);
+   else
+   page = virt_to_page(addr);
+
+   return compound_head(page);
+}
+
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[07/18] Vcompound: Add compound_nth_page() to determine nth base page

2007-10-03 Thread Christoph Lameter

Add a new function

compound_nth_page(page, n)

and
vmalloc_nth_page(page, n)

to find the nth page of a compound page. For real compound pages
his simply reduces to page + n. For virtual compound pages we need to consult
the page tables to figure out the nth page from the one specified.

Update all the references to page[1] to use compound_nth instead.

---
 include/linux/mm.h |   17 +
 mm/page_alloc.c|   16 +++-
 mm/vmalloc.c   |   10 ++
 3 files changed, 34 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-10-03 19:31:45.0 -0700
+++ linux-2.6/include/linux/mm.h2007-10-03 19:31:51.0 -0700
@@ -295,6 +295,8 @@ static inline int get_page_unless_zero(s
 }
 
 void *vmalloc_address(struct page *);
+struct page *vmalloc_to_page(void *addr);
+struct page *vmalloc_nth_page(struct page *page, int n);
 
 static inline struct page *compound_head(struct page *page)
 {
@@ -338,27 +340,34 @@ void split_page(struct page *page, unsig
  */
 typedef void compound_page_dtor(struct page *);
 
+static inline struct page *compound_nth_page(struct page *page, int n)
+{
+   if (likely(!PageVcompound(page)))
+   return page + n;
+   return vmalloc_nth_page(page, n);
+}
+
 static inline void set_compound_page_dtor(struct page *page,
compound_page_dtor *dtor)
 {
-   page[1].lru.next = (void *)dtor;
+   compound_nth_page(page, 1)->lru.next = (void *)dtor;
 }
 
 static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 {
-   return (compound_page_dtor *)page[1].lru.next;
+   return (compound_page_dtor *)compound_nth_page(page, 1)->lru.next;
 }
 
 static inline int compound_order(struct page *page)
 {
if (!PageHead(page))
return 0;
-   return (unsigned long)page[1].lru.prev;
+   return (unsigned long)compound_nth_page(page, 1)->lru.prev;
 }
 
 static inline void set_compound_order(struct page *page, unsigned long order)
 {
-   page[1].lru.prev = (void *)order;
+   compound_nth_page(page, 1)->lru.prev = (void *)order;
 }
 
 /*
Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-10-03 19:31:45.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-10-03 19:31:51.0 -0700
@@ -541,6 +541,16 @@ void *vmalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc);
 
+/*
+ * Given a pointer to the first page struct:
+ * Determine a pointer to the nth page.
+ */
+struct page *vmalloc_nth_page(struct page *page, int n)
+{
+   return vmalloc_to_page(page_address(page) + n * PAGE_SIZE);
+}
+EXPORT_SYMBOL(vmalloc_nth_page);
+
 /**
  * vmalloc_user - allocate zeroed virtually contiguous memory for userspace
  * @size: allocation size
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c  2007-10-03 19:31:51.0 -0700
+++ linux-2.6/mm/page_alloc.c   2007-10-03 19:32:45.0 -0700
@@ -274,7 +274,7 @@ static void prep_compound_page(struct pa
set_compound_order(page, order);
__SetPageHead(page);
for (i = 1; i < nr_pages; i++) {
-   struct page *p = page + i;
+   struct page *p = compound_nth_page(page, i);
 
__SetPageTail(p);
p->first_page = page;
@@ -289,17 +289,23 @@ static void destroy_compound_page(struct
if (unlikely(compound_order(page) != order))
bad_page(page);
 
-   if (unlikely(!PageHead(page)))
-   bad_page(page);
-   __ClearPageHead(page);
for (i = 1; i < nr_pages; i++) {
-   struct page *p = page + i;
+   struct page *p = compound_nth_page(page,  i);
 
if (unlikely(!PageTail(p) |
(p->first_page != page)))
bad_page(page);
__ClearPageTail(p);
}
+
+   /*
+* The PageHead is important since it determines how operations on
+* a compound page have to be performed. We can only tear the head
+* down after all the tail pages are done.
+*/
+   if (unlikely(!PageHead(page)))
+   bad_page(page);
+   __ClearPageHead(page);
 }
 
 static inline void prep_zero_page(struct page *page, int order, gfp_t 
gfp_flags)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[02/18] vunmap: return page array passed on vmap()

2007-10-03 Thread Christoph Lameter

Make vunmap return the page array that was used at vmap. This is useful
if one has no structures to track the page array but simply stores the
virtual address somewhere. The disposition of the page array can be
decided upon after vunmap. vfree() may now also be used instead of
vunmap which will release the page array after vunmap'ping it.

As noted by Kamezawa: The same subsystem that provides the page array
to vmap must must use its own method to dispose of the page array.

If vfree() is called to free the page array then the page array must either
be

1. Allocated via the slab allocator

2. Allocated via vmalloc but then VM_VPAGES must have been passed at
   vunmap to specify that a vfree is needed.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/vmalloc.h |2 +-
 mm/vmalloc.c|   32 ++--
 2 files changed, 23 insertions(+), 11 deletions(-)

Index: linux-2.6/include/linux/vmalloc.h
===
--- linux-2.6.orig/include/linux/vmalloc.h  2007-10-03 16:19:29.0 
-0700
+++ linux-2.6/include/linux/vmalloc.h   2007-10-03 16:19:41.0 -0700
@@ -49,7 +49,7 @@ extern void vfree(void *addr);
 
 extern void *vmap(struct page **pages, unsigned int count,
unsigned long flags, pgprot_t prot);
-extern void vunmap(void *addr);
+extern struct page **vunmap(void *addr);
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-10-03 16:19:35.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-10-03 16:20:15.0 -0700
@@ -152,6 +152,7 @@ int map_vm_area(struct vm_struct *area, 
unsigned long addr = (unsigned long) area->addr;
unsigned long end = addr + area->size - PAGE_SIZE;
int err;
+   area->pages = *pages;
 
BUG_ON(addr >= end);
pgd = pgd_offset_k(addr);
@@ -162,6 +163,8 @@ int map_vm_area(struct vm_struct *area, 
break;
} while (pgd++, addr = next, addr != end);
flush_cache_vmap((unsigned long) area->addr, end);
+
+   area->nr_pages = *pages - area->pages;
return err;
 }
 EXPORT_SYMBOL_GPL(map_vm_area);
@@ -318,17 +321,18 @@ struct vm_struct *remove_vm_area(void *a
return v;
 }
 
-static void __vunmap(void *addr, int deallocate_pages)
+static struct page **__vunmap(void *addr, int deallocate_pages)
 {
struct vm_struct *area;
+   struct page **pages;
 
if (!addr)
-   return;
+   return NULL;
 
if ((PAGE_SIZE-1) & (unsigned long)addr) {
printk(KERN_ERR "Trying to vfree() bad address (%p)\n", addr);
WARN_ON(1);
-   return;
+   return NULL;
}
 
area = remove_vm_area(addr);
@@ -336,29 +340,30 @@ static void __vunmap(void *addr, int dea
printk(KERN_ERR "Trying to vfree() nonexistent vm area (%p)\n",
addr);
WARN_ON(1);
-   return;
+   return NULL;
}
 
+   pages = area->pages;
debug_check_no_locks_freed(addr, area->size);
 
if (deallocate_pages) {
int i;
 
for (i = 0; i < area->nr_pages; i++) {
-   struct page *page = area->pages[i];
+   struct page *page = pages[i];
 
BUG_ON(!page);
__free_page(page);
}
 
if (area->flags & VM_VPAGES)
-   vfree(area->pages);
+   vfree(pages);
else
-   kfree(area->pages);
+   kfree(pages);
}
 
kfree(area);
-   return;
+   return pages;
 }
 
 /**
@@ -387,10 +392,10 @@ EXPORT_SYMBOL(vfree);
  *
  * Must not be called in interrupt context.
  */
-void vunmap(void *addr)
+struct page **vunmap(void *addr)
 {
BUG_ON(in_interrupt());
-   __vunmap(addr, 0);
+   return __vunmap(addr, 0);
 }
 EXPORT_SYMBOL(vunmap);
 
@@ -403,6 +408,13 @@ EXPORT_SYMBOL(vunmap);
  *
  * Maps @count pages from @pages into contiguous kernel virtual
  * space.
+ *
+ * The page array may be freed via vfree() on the virtual address
+ * returned. In that case the page array must be allocated via
+ * the slab allocator. If the page array was allocated via
+ * vmalloc then VM_VPAGES must be specified in the flags. There is
+ * no support for vfree() to free a page array allocated via the
+ * page allocator.
  */
 void *vmap(struct page **pages, unsigned int count,
unsigned long flags, pgprot_t prot)

-- 
-
To unsubscribe from this list: send the line "unsubscribe

[03/18] vmalloc_address(): Determine vmalloc address from page struct

2007-10-03 Thread Christoph Lameter

Sometimes we need to figure out which vmalloc address is in use
for a certain page struct. There is no easy way to figure out
the vmalloc address from the page struct. Simply search through
the kernel page tables to find the address. Use sparingly.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mm.h |2 +
 mm/vmalloc.c   |   79 +
 2 files changed, 81 insertions(+)

Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-10-03 16:20:15.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-10-03 16:20:48.0 -0700
@@ -840,3 +840,82 @@ void free_vm_area(struct vm_struct *area
kfree(area);
 }
 EXPORT_SYMBOL_GPL(free_vm_area);
+
+
+/*
+ * Determine vmalloc address from a page struct.
+ *
+ * Linear search through all ptes of the vmalloc area.
+ */
+static unsigned long vaddr_pte_range(pmd_t *pmd, unsigned long addr,
+   unsigned long end, unsigned long pfn)
+{
+   pte_t *pte;
+
+   pte = pte_offset_kernel(pmd, addr);
+   do {
+   pte_t ptent = *pte;
+   if (pte_present(ptent) && pte_pfn(ptent) == pfn)
+   return addr;
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+   return 0;
+}
+
+static inline unsigned long vaddr_pmd_range(pud_t *pud, unsigned long addr,
+   unsigned long end, unsigned long pfn)
+{
+   pmd_t *pmd;
+   unsigned long next;
+   unsigned long n;
+
+   pmd = pmd_offset(pud, addr);
+   do {
+   next = pmd_addr_end(addr, end);
+   if (pmd_none_or_clear_bad(pmd))
+   continue;
+   n = vaddr_pte_range(pmd, addr, next, pfn);
+   if (n)
+   return n;
+   } while (pmd++, addr = next, addr != end);
+   return 0;
+}
+
+static inline unsigned long vaddr_pud_range(pgd_t *pgd, unsigned long addr,
+   unsigned long end, unsigned long pfn)
+{
+   pud_t *pud;
+   unsigned long next;
+   unsigned long n;
+
+   pud = pud_offset(pgd, addr);
+   do {
+   next = pud_addr_end(addr, end);
+   if (pud_none_or_clear_bad(pud))
+   continue;
+   n = vaddr_pmd_range(pud, addr, next, pfn);
+   if (n)
+   return n;
+   } while (pud++, addr = next, addr != end);
+   return 0;
+}
+
+void *vmalloc_address(struct page *page)
+{
+   pgd_t *pgd;
+   unsigned long next, n;
+   unsigned long addr = VMALLOC_START;
+   unsigned long pfn = page_to_pfn(page);
+
+   pgd = pgd_offset_k(VMALLOC_START);
+   do {
+   next = pgd_addr_end(addr, VMALLOC_END);
+   if (pgd_none_or_clear_bad(pgd))
+   continue;
+   n = vaddr_pud_range(pgd, addr, next, pfn);
+   if (n)
+   return (void *)n;
+   } while (pgd++, addr = next, addr < VMALLOC_END);
+   return NULL;
+}
+EXPORT_SYMBOL(vmalloc_address);
+
Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h   2007-10-03 16:19:27.0 -0700
+++ linux-2.6/include/linux/mm.h2007-10-03 16:20:48.0 -0700
@@ -294,6 +294,8 @@ static inline int get_page_unless_zero(s
return atomic_inc_not_zero(>_count);
 }
 
+void *vmalloc_address(struct page *);
+
 static inline struct page *compound_head(struct page *page)
 {
if (unlikely(PageTail(page)))

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[05/18] Page flags: Add PageVcompound()

2007-10-03 Thread Christoph Lameter

Add a another page flag that can be used to figure out if a compound
page is virtually mapped. The mark is necessary since we have to know
when freeing pages if we have to destroy a virtual mapping. No additional
flag is consumed through the use of PG_swapcache together with PG_compound
(similar to PageHead() and PageTail()).

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/page-flags.h |   18 ++
 1 file changed, 18 insertions(+)

Index: linux-2.6/include/linux/page-flags.h
===
--- linux-2.6.orig/include/linux/page-flags.h   2007-10-03 19:31:51.0 
-0700
+++ linux-2.6/include/linux/page-flags.h2007-10-03 19:34:37.0 
-0700
@@ -248,6 +248,24 @@ static inline void __ClearPageTail(struc
 #define __SetPageHead(page)__SetPageCompound(page)
 #define __ClearPageHead(page)  __ClearPageCompound(page)
 
+/*
+ * PG_swapcache is used in combination with PG_compound to indicate
+ * that a compound page was allocated via vmalloc.
+ */
+#define PG_vcompound_mask ((1L << PG_compound) | (1L << PG_swapcache))
+#define PageVcompound(page)((page->flags & PG_vcompound_mask) \
+   == PG_vcompound_mask)
+
+static inline void __SetPageVcompound(struct page *page)
+{
+   page->flags |= PG_vcompound_mask;
+}
+
+static inline void __ClearPageVcompound(struct page *page)
+{
+   page->flags &= ~PG_vcompound_mask;
+}
+
 #ifdef CONFIG_SWAP
 #define PageSwapCache(page)test_bit(PG_swapcache, &(page)->flags)
 #define SetPageSwapCache(page) set_bit(PG_swapcache, &(page)->flags)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[00/18] Virtual Compound Page Support V2

2007-10-03 Thread Christoph Lameter

Allocations of larger pages are not reliable in Linux. If larger
pages have to be allocated then one faces various choices of allowing
graceful fallback or using vmalloc with a performance penalty due
to the use of a page table. Virtual Compound pages are
a simple solution out of this dilemma. If an allocation specifies
GFP_VFALLBACK then the page allocator will first attempt to satisfy
the request with physically contiguous memory. If that is not possible
then the page allocator will create a virtually contiguous memory
area for the caller. That way large allocations may perhaps be
considered "reliable" indepedent of the memory fragmentation situation.

This means that memory with optimal performance is used when available.
We are currently gradually introducing methods to reduce memory
defragmentation. The better these methods become the less the
chances that fallback will occur.

Fallback is rare in particular on machines with contemporary memory
sizes of 1G or more. It seems to take special load situations that
pin a lot of memory and systems with low memory in order to get
system memory so fragmented that the fallback scheme must kick in.

There is therefore a compile time option to switch on fallback for
testing purposes. Virtually mapped mmemory may behave differently
and the CONFIG_FALLBACK_ALWAYS option will insure that the code is
tested to deal with virtual memory.

The patchset then addresses a series of issues in the current code
through the use of fallbacks:

- Fallback for x86_64 stack allocations. The default stack size
  is 8k which requires an order 1 allocation.

- Removes the manual fallback to vmalloc for sparsemem
  through the use of GFP_VFALLBACK.

- Uses a compound page for the wait table in the zone thereby
  avoiding having to go through a page table to get to the
  data structures used for waiting on events in pages.

- Allows fallback for the order 2 allocation in the crypto
  subsystem.

- Allows fallback for the caller table used by SLUB when determining
  the call sites for slab caches for sysfs output.

- Allows a configurable stack size on x86_64 (up to 32k).

More uses are possible by simply adding GFP_VFALLBACK to the page
flags or by converting vmalloc calls to regular page allocator calls.

It is likely that we have had to avoid the use of larger memory areas
because of the reliability issues. The patch may simplify future coding
of handling large memoryh areas because these issues are taken care of
by the page allocator. For HPC uses we constantly have to deal with
demands for larger and larger memory areas to speed up various loads.

Additional patches exist to enable SLUB and the Large Blocksize Patchset
to use these fallbacks.

The patchset is also available via git from the largeblock git tree via

git pull
  git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
vcompound

V1->V2
- Remove some cleanup patches and the SLUB patches from this set.
- Transparent vcompound support through page_address() and
  virt_to_head_page().
- Additional use cases.
- Factor the code better for an easier read
- Add configurable stack size.
- Follow up on various suggestions made for V1

RFC->V1
- Complete support for all compound functions for virtual compound pages
  (including the compound_nth_page() necessary for LBS mmap support)
- Fix various bugs
- Fix i386 build

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[01/18] vmalloc: clean up page array indexing

2007-10-03 Thread Christoph Lameter

The page array is repeatedly indexed both in vunmap and vmalloc_area_node().
Add a temporary variable to make it easier to read (and easier to patch
later).

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/vmalloc.c |   16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/vmalloc.c
===
--- linux-2.6.orig/mm/vmalloc.c 2007-10-02 09:26:16.0 -0700
+++ linux-2.6/mm/vmalloc.c  2007-10-02 21:35:34.0 -0700
@@ -345,8 +345,10 @@ static void __vunmap(void *addr, int dea
int i;
 
for (i = 0; i < area->nr_pages; i++) {
-   BUG_ON(!area->pages[i]);
-   __free_page(area->pages[i]);
+   struct page *page = area->pages[i];
+
+   BUG_ON(!page);
+   __free_page(page);
}
 
if (area->flags & VM_VPAGES)
@@ -450,15 +452,19 @@ void *__vmalloc_area_node(struct vm_stru
}
 
for (i = 0; i < area->nr_pages; i++) {
+   struct page *page;
+
if (node < 0)
-   area->pages[i] = alloc_page(gfp_mask);
+   page = alloc_page(gfp_mask);
else
-   area->pages[i] = alloc_pages_node(node, gfp_mask, 0);
-   if (unlikely(!area->pages[i])) {
+   page = alloc_pages_node(node, gfp_mask, 0);
+
+   if (unlikely(!page)) {
/* Successfully allocated i pages, free them in 
__vunmap() */
area->nr_pages = i;
goto fail;
}
+   area->pages[i] = page;
}
 
if (map_vm_area(area, prot, ))

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Eric W. Biederman

Loic Prylli <[EMAIL PROTECTED]> writes:

> Even if the INTx line is not raised, you cannot rely on the device to
> retain memory of a interrupt triggered while MSI are disabled, and
> expect it to fire it under MSI form later when MSI are reenabled. 

Sure.  My expectation is if we happened to hit such a narrow window
the irq would simply be dropped.

>
>> If you have a mask bit implemented you are required to be
>> able to refire it after the msi is enabled. 
>
>
>
> Indeed the masking case is well-defined by the spec (including the
> operation of the pending bits). And my subject was definitely restricted
> to devices without that masking capability.

Right.  And INTx has such a pending bit as well.  I guess I figured
if MSI was enabled transferring it over would be the obvious thing to
do.

> OK no-op was a bug, but using the enable-bit for temporary masking
> purposes still feels like a bug. I am afraid the only safe solution
> might be to prohibit any operation that absolutely requires masking if
> real masking is not available. Maybe the set_affinity method should
> simply be disabled for device not supported masking (unless there is an
> option of doing it without masking for instance by guaranteeing only one
> word of the MSI capability is changed).

It's worth looking at, I think that happens in the common case.

Of course it might even make sense simply to refuse to enable MSI
if there is not a masking capability present.

>> The PCI spec requires disabling/masking the msi when reprogramming it.
>> So as a general rule we can not do better. 
>
>
>
> Do you have a reference for that requirement. The spec only vaguely
> associates MSI programming with "configuration", but I haven't found any
> explicit indication that it should not work.

I would have to look it up again but it said that the result is only
defined in the case when it is disabled/masked, when I looked a couple
of months ago.

>> I suspect what needs to happen is a spec search to verify that the
>> current linux behavior is at least reasonable within the spec.
>>   
>
>
> I don't see how you can disable MSI through the control bit (which is
> equivalent to switching the device to INTx whether or not the INTx
> disable bit is set in PCI_COMMAND) in the middle of operations, not tell
> the driver, and not risk loosing interrupts (unless you rely on much
> more than the spec).

I will relook.  My impression is that bit is defined as MSI enable.
Not mode switch.   Although myrinet has clearly implemented it as
mode switch.

>> I don't want to break anyones hardware, but at the same time I want us
>> to be careful and in spec for the default case.
>>
>>   
>
>
> The interrupt while doing set_affinity masking would certainly cause a
> problem for the device we use (MSI-enable switch between INTx and MSI
> mode, and both interrupts are not acked the same way assuming they would
> even be delivered to the driver), but I got some new data: upon further
> examination, the lost interrupts we have seen seems in fact caused at a
> different time:
> - the problem is the  mask_ack_irq() done in handle_edge_irq() when a
> new interrupt arrives before the IRQ_PROGRESS bit is cleared at the end
> of the function.
>
> Again here, switching MSI-off during hot operation breaks the interrupt
> accounting and handshaking between our driver and device. At least this
> case might be easier to handle, it seems safe to not mask there (when
> some proven masking is not available).

Interesting.  So an irq fires before the driver has finished
processing the last instance of the irq.  This is very close to a
screaming irq and something we may actually want to deal with.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: File corruption when using kernels 2.6.18+

2007-10-03 Thread Linus Torvalds

On Wed, 3 Oct 2007, Robert Hancock wrote:
> 
> Erratum 97: 128-Bit Streaming Stores May Cause Coherency Failure

The Intel-optimized memcpy doesn't use the SSE registers, just regular 
32-bit integer nontemporal stores (movnti). The reason is that the SSE 
state save is too expensive to be worth it.

So it's not that. Also, considering that it was a single-bit error in all 
the cases I saw, I wouldn't expect it to be a cache coherency problem, 
which I'd expect to corrupt a whole cacheline or possibly at least a whole 
access.

That said, bit corruption can be just about anything. It's certainly not 
impossible that it's a CPU bug.

But my first guess would be slightly dodgy motherboard, possibly coupled 
with a chipset that simply isn't very tolerant to any timing errors. If 
the motherboard traces to the DDR aren't impedance-matched, or if the 
traces don't have the same length, or if the capacitors that are supposed 
to handle spikes in burst current aren't up to snuff, you'll just get 
noisy lines.

And at some point, noisy lines means that you go from reliable operation 
to "oh, that bit didn't make it correctly".

Lowering the front-side bus frequency or altering the memory timings can 
help (ie doing things like running DDR-333 at DDR-266). Making sure that 
your power supply isn't even close to its limits is good. And choosing a 
motherboard and chipsets from a reliable manufacturer is more than a good 
idea.

The reason why it's interesting that the errors seemed to happen in the 
same byte-lane is that I think it's common policy to route data lines on 
the same layer, and matching trace length per group is very important, 
because you do signal clocking per-group, afaik. But on the other hand, 
multiple layers on the board are expensive, so people try to minimize 
them, and maybe you end up routing through a via to another layer - which 
then makes timing and capacitance harder.

Or there aren't ground lines close enough, or the data lines are too close 
to other lines and you get cross-talk etc etc.

No, I've not done board design, and I don't know what I'm talking about, 
but look at the interesting zig-zagging the data (and address) lines often 
do on the board. It often looks totally crazy ("why doesn't that line just 
go straight?"), but the thing is that the groups all need to have the same 
length, but the pins are all at different points, so you can't make the 
lines straight, or some of them would be much shorter than others.

And if something is border-line, it may work all of the time - *until* you 
hit specific patterns that cause lots of lines to wiggle around, and then 
a capacitor won't handle the extra current draw from switching, or 
cross-talk between lines hits you, and what used to work doesn't work any 
more.

I wish we all had ECC memory. That gets rid of a lot of worries.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch take 2][Intel-IOMMU] Fix for IOMMU early crash

2007-10-03 Thread Benjamin Herrenschmidt


> > Why don't you use the new struct dev_archdata mechanism ? That's what I
> > use on powerpc to provide optional iommu linkage to any device in the
> > system.
> Good one. I will certainly try out your idea and will update the list
> tomorrow.

The advantage is that it allows to completely isolate the iommu code
from any dependency to PCI, which means you can implement DMA ops
support for various platform devices or other fancy things. Maybe not
the most useful in x86-land, but still ;-)

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: File corruption when using kernels 2.6.18+

2007-10-03 Thread Robert Hancock


Linus Torvalds wrote:


On Wed, 3 Oct 2007, Pekka Enberg wrote:

On 10/3/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:

I would bet that the reason the intel-optimized memcpy triggers this is
that the non-temporal stores just means that you go out directly on the
bus, and it probably just shows a weakness in the chipset or bus that
doesn't show with the normal cacheline accesses.

But that should show up with memtest too, no?


Not unless memtest uses non-temporal stores with the same (or similar) 
access patterns.


The thing is, the CPU cache hides a *lot* of activity from the chipset, 
and changes the access patterns radically. 

With normal cached accesses, you'd normally see just the "fill cacheline" 
and "write out cacheline" pattern. With movnt, you'd see non-cacheline 
accesses to memory. If the chipset was tested under mostly normal loads, 
the movnt cases have been getting a lot less coverage.


Now, I do agree that it certainly *can* be a CPU bug too.  I doubt it, 
though. 

I'd check the power supply (brownouts cause random corruption, and it 
might have a "peak power pattern" thing to it), and it's worth re-seating 
any DIMM's etc. And it's definitely worth going into the BIOS setup screen 
and making sure that nothing is even close to debatable (ie take RAM 
timings down to non-aggressive levels, make sure bus frequencies and 
multipliers are not even close to borderline, etc etc).


I didn't see what CPU this was, but there was this nasty erratum on some 
Athlon 64/Opteron processors. I was trying to debug a problem someone 
else mentioned a while ago (and which I could duplicate on my system) 
where doing huge memsets in userspace (which glibc uses non-temporal 
stores for) repeatedly would cause a system lockup or crash. Amazingly 
enough after I upgraded the CPU from my old Athlon 64 3500+ to a new X2 
4200+ the problem went away..


At the time I looked into whether this workaround could be applied in 
the kernel if the BIOS failed to, but it seemed that accesses to the MSR 
they mentioned failed, so I don't know what the story is..


from 
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25759.pdf 



Erratum 97: 128-Bit Streaming Stores May Cause Coherency Failure
Description: Under a specific set of internal pipeline conditions, stale 
data may be left in the L1 cache when a 128-bit streaming store (MOVNT*) 
to a writeback (WB) memory type misses in the L1 data cache and both L1 
and L2 TLBs.

Potential Effect on System
Memory coherence failures leading to unpredictable operation.
Suggested Workaround
BIOS should set DC_CFG.DIS_CNV_WC_SSO (bit 3 of MSR 0xC001_1022). The 
performance effects of setting this bit are limited to streaming stores 
to the write-combining (WC) memory type, a case expected to rarely occur 
in actual usage. No loss of performance occurs in the general case (WB 
memory type).

This workaround must not be applied to processors prior to revision C0.

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] task containersv11 add tasks file interface

2007-10-03 Thread Paul Jackson

One more cgroup code review detail ...

The following is evidence of some more stale comments in
kernel/cpuset.c.  Some routines which used to be in that file, but
which are now reimplemented in cgroups, are still named in cpuset.c
comments:

$ grep -E 'cpuset_rmdir|cpuset_exit|cpuset_fork' kernel/cpuset.c
 * knows that the cpuset won't be removed, as cpuset_rmdir() needs
 * The fork and exit callbacks cpuset_fork() and cpuset_exit(), don't
 * critical pieces of code here.  The exception occurs on cpuset_exit(),
 *the_top_cpuset_hack in cpuset_exit(), which sets an exiting tasks

The downside of my writing too many comments ... its more of a
maintenance burden on those changing the code ;).

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] task containersv11 add tasks file interface

2007-10-03 Thread Paul Menage

On 10/3/07, Paul Jackson <[EMAIL PROTECTED]> wrote:
>
> I can't say for sure, but I suspect that if cgroups had always
> been cgroups (short for control groups), then these local 'cont'
> variables would have a different name.

Oh, absolutely. I just refrained from changing them in the rename
since the name was sort of still relevant and it made the change
simpler.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] task containersv11 add tasks file interface

2007-10-03 Thread Paul Jackson

> >  - There are many instances of the local variable 'cont', referring
> >to a struct cgroup pointer.  I presume the spelling 'cont' is a
> >holdover from the time when we called these containers.
> 
> Yes, and since cgroup is short for "control group", "cont" still
> seemed like a reasonable abbreviation. (And made the automatic
> renaming much simpler).

The following will change all 'cont' words to your choice (I doubt
you want to use '' as I did here) in cgroup.c:

sed -i -r 's/(\W|^)cont(\W|$)/\1\2/g' kernel/cgroup.c

I can't say for sure, but I suspect that if cgroups had always
been cgroups (short for control groups), then these local 'cont'
variables would have a different name.  One can often, as in this
case, find some justification for most any name.  The question is
which name is most quickly and easily understood.

... yes ... I'm a stickler for names ... sorry.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Code style fix for open_exec

2007-10-03 Thread Casey Dahlin


From d2a6c5d29dc34cfea892124ab72b4eb55d2f8a80 Mon Sep 17 00:00:00 2001
From: Casey Dahlin <[EMAIL PROTECTED]>
Date: Wed, 3 Oct 2007 22:01:49 -0400
Subject: [PATCH] Code style fix for open_exec

Fix a horribly mangled 5 level indent and severe abuse of goto in the 
open_exec

function.

Signed-off-by: Casey Dahlin <[EMAIL PROTECTED]>

diff --git a/fs/exec.c b/fs/exec.c
index c21a8cc..d73da5a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -676,32 +676,32 @@ struct file *open_exec(const char *name)
struct file *file;

err = path_lookup_open(AT_FDCWD, name, LOOKUP_FOLLOW, , 
FMODE_READ|FMODE_EXEC);

+
+if (err) return ERR_PTR(err);
+
+file = ERR_PTR(-EACCES);
+if ((nd.mnt->mnt_flags & MNT_NOEXEC) ||
+!S_ISREG(nd.dentry->d_inode->i_mode))
+goto fail;
+
+err = vfs_permission(, MAY_EXEC);
file = ERR_PTR(err);
+if (err) goto fail;

-if (!err) {
-struct inode *inode = nd.dentry->d_inode;
-file = ERR_PTR(-EACCES);
-if (!(nd.mnt->mnt_flags & MNT_NOEXEC) &&
-S_ISREG(inode->i_mode)) {
-int err = vfs_permission(, MAY_EXEC);
-file = ERR_PTR(err);
-if (!err) {
-file = nameidata_to_filp(, O_RDONLY);
-if (!IS_ERR(file)) {
-err = deny_write_access(file);
-if (err) {
-fput(file);
-file = ERR_PTR(err);
-}
-}
-out:
-return file;
-}
-}
-release_open_intent();
-path_release();
+file = nameidata_to_filp(, O_RDONLY);
+if (IS_ERR(file)) return file;
+
+err = deny_write_access(file);
+if (err) {
+fput(file);
+file = ERR_PTR(err);
}
-goto out;
+
+return file;
+fail:
+release_open_intent();
+path_release();
+return file;
}

EXPORT_SYMBOL(open_exec);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

2007-10-03 Thread Fengguang Wu

On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
> On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > > > wbc.pages_skipped = 0;
> > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > > > /* Wrote less than expected */
> > > > -   congestion_wait(WRITE, HZ/10);
> > > > -   if (!wbc.encountered_congestion)
> > > > +   if (wbc.encountered_congestion || wbc.more_io)
> > > > +   congestion_wait(WRITE, HZ/10);
> > > > +   else
> > > > break;
> > > > }
> > > 
> > > Why do you call congestion_wait() if there is more I/O to issue?  If
> > > we have a fast filesystem, this might cause the device queues to
> > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > > will have trouble keeping the queues full, right?
> > 
> > You mean slow writers and fast RAID? That would be exactly the case
> > these patches try to improve.
> 
> I mean any writers and a fast block device (raid or otherwise).
> 
> > This patchset makes kupdate/background writeback more responsible,
> > so that if (avg-write-speed < device-capabilities), the dirty data are
> > synced timely, and we don't have to go for balance_dirty_pages().
> 
> Sure, but I'm asking about the effect of the patches on the
> (avg-write-speed == device-capabilities) case. I agree that
> they are necessary for timely syncing of data but I'm trying
> to understand what effect they have on the normal write case

> (i.e. keeping the disk at full write throughput).

OK, I guess it is the focus of all your questions: Why should we sleep
in congestion_wait() and possibly hurt the write throughput? I'll try
to summary it:

- congestion_wait() is necessary
Besides device congestions, there may be other blockades we have to
wait on, e.g. temporary page locks, NFS/journal issues(I guess).

- congestion_wait() is called only when necessary
congestion_wait() will only be called we saw blockades:
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
congestion_wait(WRITE, HZ/10);
}
So in normal case, it may well write 128MB data without any waiting.

- congestion_wait() won't hurt write throughput
When not congested, congestion_wait() will be wake up on each write
completion. Note that MAX_WRITEBACK_PAGES=1024 and
/sys/block/sda/queue/max_sectors_kb=512(for me),
which means we are gave the chance to sync 4MB on every 512KB written,
which means we are able to submit write IOs 8 times faster than the
device capability. congestion_wait() is a magical timer :-)

> > So for your question of queue depth, the answer is: the queue length
> > will not build up in the first place. 
> 
> Which queue are you talking about here? The block deivce queue?

Yes, the elevator's queues.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Serial ATA does not find partitions (Hitachi HD, new? ATI controller) where old SATA works

2007-10-03 Thread Tejun Heo

Hernan G Solari wrote:
>>>   netconsole, pritty nice debunging system... but (yes, there is always
>>> a but) it does not get to run.
>>>   the method was well implemented, adding the acpi=off it sends the
>>> information to the receiving machine (I can even see passing a
>>> netconsole probing message in the machine under testing), but without
>>> turning off acpi it does not reach the point of loading and,
>>> consequently, it does not send a byte to the receiving machine.
>>>
>>> Hence, result: empty output.
>>>
>>>
>> Hmmm.. Are netconsole support and network driver built into the kernel?
>>
> yes, they are

If you can set up a serial console, it would be better.  If not, can you
please take a photo of the crash and post it?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Point of gpl-only modules (flame)

2007-10-03 Thread Robert Hancock


Jimmy wrote:

I know I'll be getting hell for this, I must be a masochist.

Anyway, I've been trying to figure out what purpose the gpl-only code 
serves.
What good comes out of disabling people from probing modules that do not 
have a gpl-compatible license?


Who is disabling anything?

Of cause, I would love to see more hardware manufactures release either 
full specs, or GPL'd drivers, and I'm sure it will happen, in time.
But until then, why are people wasting time writing code to inhibit 
those who do not agree with them on licensing?


It seems pretty childish to try and force some license on people, 
imagine trying to install firefox on Windows Vista, an error-dialog box 
appears:
"This application has been denied access to the Windows API as its 
license are compatible with the Microsoft Philosophy" ?


Now, i don't want to waste clock cycles on executing code that serves no 
purpose but restraining me from using my $1500 gfx card as intended, so 
will me removing that crap from the source result in somebody trying to 
obfuscate it to a point where neither of us know what is what?


Also, how about a list of PROS, explain to me whats so cool about it?


The kernel gets marked as tainted when you load proprietary modules 
because with no source code available there is no way to determine what 
kind of badness the code may have done to break the kernel. Bug reports 
from tainted kernels are generally given fairly little weight.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch / 002](memory hotplug) Callback function to create kmem_cache_node.

2007-10-03 Thread Yasunori Goto

> On Wed, 3 Oct 2007, Yasunori Goto wrote:
> 
> > > 
> > > That would work. But it would be better to shrink the cache first. The 
> > > first 2 slabs on a node may be empty and the shrinking will remove those. 
> > > If you do not shrink then the code may falsely assume that there are 
> > > objects on the node.
> > 
> > I'm sorry, but I don't think I understand what you mean... :-(
> > Could you explain more? 
> > 
> > Which slabs should be shrinked? kmem_cache_node and kmem_cache_cpu?
> 
> The slab for which you are trying to set the kmem_cache_node pointer to 
> NULL needs to be shrunk.
>  
> > I think kmem_cache_cpu should be disabled by cpu hotplug,
> > not memory/node hotplug. Basically, cpu should be offlined before
> > memory offline on the node.
> 
> Hmmm.. Ok for cpu hotplug you could simply disregard the per cpu 
> structure if the per cpu slab was flushed first.
> 
> However, the per node structure may hold slabs with no objects even after 
> all objects were removed on a node. These need to be flushed by calling
> kmem_cache_shrink() on the slab cache.
> 
> On the other hand: If you can guarantee that they will not be used and 
> that no objects are in them and that you can recover the pages used in 
> different ways then zapping the per node pointer like that is okay.

Thanks for your advise. I'll reconsider and fix my patches.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi

2007-10-03 Thread Fengguang Wu

On Wed, Oct 03, 2007 at 01:46:52PM +0100, richard kennedy wrote:
> On Tue, 2007-10-02 at 10:00 +0800, Fengguang Wu wrote:
> > ---
> >  mm/page-writeback.c |5 +
> >  1 file changed, 5 insertions(+)
> > 
> > --- linux-2.6.22.orig/mm/page-writeback.c
> > +++ linux-2.6.22/mm/page-writeback.c
> > @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a
> > pages_written += write_chunk - wbc.nr_to_write;
> > if (pages_written >= write_chunk)
> > break;  /* We've done our duty */
> > +   if (list_empty(>host->i_sb->s_dirty) &&
> > +   list_empty(>host->i_sb->s_io) &&
> > +   nr_reclaimable + global_page_state(NR_WRITEBACK) <=
> > +   dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT)))
> > +   break;
> > }
> > congestion_wait(WRITE, HZ/10);
> > }
> 
> I've been testing 2.6.23-rc9 + this patch all morning but have just seen
> a lockup. As usual it happened  just after a large file copy finished
> and while nr_dirty is still large. I'm sorry to say I didn't have a
> serial console running so I don't have an other info. I will try again
> and see if I can capture some more data.
> 
> I did notice that at the beginning of my tests the dirty blocks are
> written back more quickly than usual
> 
> nr_dirty count after the copy finished and then 60 seconds later :-
> after copy+60 seconds
> 73520 0
> 73533 0
> 68554 1
>  
> but after several iterations of my testcase & just before the lockup
> 68560 57165
> 71974 62896
>  
> which is about the same as a unpatched kernel.

Hi Richard,

Thank you for the testing.

However, my patch is kind of duplicate efforts. I was taking the
'do it if simple' attitude.  I can continue to improve it if you 
really want it. Otherwise I'd recommend you to test the coming
2.6.24-rc1 or backport the -mm writeback patches back to 2.6.23 and
test it there. Peter has did a good job on it.

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + add-documentation-w1w1-masters-00-index.patch added to -mm tree

2007-10-03 Thread Rob Landley

On Wednesday 03 October 2007 4:38:49 pm Randy Dunlap wrote:
> On Wed, 03 Oct 2007 14:17:33 -0700 [EMAIL PROTECTED] wrote:
> > The patch titled
> >  Add Documentation/{w1,w1/masters}/00-INDEX
> > has been added to the -mm tree.  Its filename is
> >  add-documentation-w1w1-masters-00-index.patch
> >
> > *** Remember to use Documentation/SubmitChecklist when testing your code
> > ***
> >
> > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to
> > find out what to do about this
> >
> > --
> > Subject: Add Documentation/{w1,w1/masters}/00-INDEX
> > From: Rob Landley <[EMAIL PROTECTED]>
> >
> > Two 00-INDEX files under Documentation/w1
> >
> > Signed-off-by: Rob Landley <[EMAIL PROTECTED]>
> > Cc: Evgeniy Polyakov <[EMAIL PROTECTED]>
> > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> > ---
> >
> >
> > diff -puN /dev/null Documentation/w1/00-INDEX
> > --- /dev/null
> > +++ a/Documentation/w1/00-INDEX
> > @@ -0,0 +1,8 @@
> > +00-INDEX
> > +   - This file
> > +masters/
> > +   - Individual chips providing 1-wire busses.
> > +w1.generic
> > +   - The 1-wire (w1) bus
> > +w1.netlink
> > +   - Userspace communication protocol over connector [1].
> > diff -puN /dev/null Documentation/w1/masters/00-INDEX
> > --- /dev/null
> > +++ a/Documentation/w1/masters/00-INDEX
> > @@ -0,0 +1,6 @@
> > +00-INDEX
> > +   - This file
> > +ds2482
> > +   - The Maixm/Dallas Semiconductor DS2482 provides 1-wire busses.
> > +ds2490
> > +   - The Maixm/Dallas Semiconductor DS2490 builds USB <-> W1 bridges.
>
>   Maxim (2 times)

That typo was cut and paste from the the "Description" section of both files.  
(Lines 18 and 13, respectively.)  :(

Attached is an updated version that spells it "maxim" and also fixes the typos 
in the source files, if that helps...

>
> Was this patch posted to a mailing list?  if so, which one?
> I didn't see it.

LKML on saturday.
http://lkml.org/lkml/2007/9/29/168

My pending patches are all at http://landley.net/kdocs/make/patches although 
I'm waiting for the current batch to work through before posting more.

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
From: Rob Landley <[EMAIL PROTECTED]>

Two 00-INDEX files under Documentation/w1 plus typo fixes.

Signed-off-by: Rob Landley <[EMAIL PROTECTED]>
---

 Documentation/w1/masters/ds2482 |2 +-
 Documentation/w1/masters/ds2490 |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff -r de183081194a Documentation/w1/masters/ds2482
--- a/Documentation/w1/masters/ds2482	Tue Oct 02 18:00:56 2007 +
+++ b/Documentation/w1/masters/ds2482	Wed Oct 03 20:28:05 2007 -0500
@@ -15,7 +15,7 @@ Description
 Description
 ---
 
-The Maixm/Dallas Semiconductor DS2482 is a I2C device that provides
+The Maxim/Dallas Semiconductor DS2482 is a I2C device that provides
 one (DS2482-100) or eight (DS2482-800) 1-wire busses.
 
 
diff -r de183081194a Documentation/w1/masters/ds2490
--- a/Documentation/w1/masters/ds2490	Tue Oct 02 18:00:56 2007 +
+++ b/Documentation/w1/masters/ds2490	Wed Oct 03 20:28:05 2007 -0500
@@ -10,7 +10,7 @@ Description
 Description
 ---
 
-The Maixm/Dallas Semiconductor DS2490 is a chip
+The Maxim/Dallas Semiconductor DS2490 is a chip
 which allows to build USB <-> W1 bridges.
 
 DS9490(R) is a USB <-> W1 bus master device
--- /dev/null	2007-04-23 10:59:00.0 -0500
+++ hg/Documentation/w1/00-INDEX	2007-10-03 20:26:38.0 -0500
@@ -0,0 +1,8 @@
+00-INDEX
+	- This file
+masters/
+	- Individual chips providing 1-wire busses.
+w1.generic
+	- The 1-wire (w1) bus
+w1.netlink
+	- Userspace communication protocol over connector [1].
--- /dev/null	2007-04-23 10:59:00.0 -0500
+++ hg/Documentation/w1/masters/00-INDEX	2007-10-03 20:26:55.0 -0500
@@ -0,0 +1,6 @@
+00-INDEX
+	- This file
+ds2482
+	- The Maxim/Dallas Semiconductor DS2482 provides 1-wire busses.
+ds2490
+	- The Maxim/Dallas Semiconductor DS2490 builds USB <-> W1 bridges.

Re: File corruption when using kernels 2.6.18+

2007-10-03 Thread Hiro Yoshioka

Hi,

From: Linus Torvalds <[EMAIL PROTECTED]>
> On Wed, 3 Oct 2007, Pekka Enberg wrote:
> > 
> > On 10/3/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
> > > I would bet that the reason the intel-optimized memcpy triggers this is
> > > that the non-temporal stores just means that you go out directly on the
> > > bus, and it probably just shows a weakness in the chipset or bus that
> > > doesn't show with the normal cacheline accesses.
> > 
> > But that should show up with memtest too, no?
> 
> Not unless memtest uses non-temporal stores with the same (or similar) 
> access patterns.
> 
> The thing is, the CPU cache hides a *lot* of activity from the chipset, 
> and changes the access patterns radically. 
> 
> With normal cached accesses, you'd normally see just the "fill cacheline" 
> and "write out cacheline" pattern. With movnt, you'd see non-cacheline 
> accesses to memory. If the chipset was tested under mostly normal loads, 
> the movnt cases have been getting a lot less coverage.

I'm not so sure whether it is chipset's bug or not.

The movnt does have the WC (write combining) semantics and
bypass the hardware cache to store the data.

http://www.intel.com/products/processor/manuals/index.htm

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 1: Basic Architecture

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3A: System Programming Guide

Thanks in advance,
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Loic Prylli

On 10/3/2007 5:49 PM, Eric W. Biederman wrote:
> Loic Prylli <[EMAIL PROTECTED]> writes:
>
>   
>> Hi,
>>
>> We observe a problem with MSI since kernel 2.6.21 where interrupts would
>> randomly stop working. We have tracked it down to the new
>> msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
>> providing a "native" MSI mask, it was a no-op before, and now it
>> disables MSI in the MSI-ctl register which according to the PCI spec is
>> interpreted as reverting the device to legacy interrupts. If such a
>> device try to generate a new interrupt during the "masked" window, the
>> device will try a legacy interrupt which is generally
>> ignored/never-acked and cause interrupts to no longer work for the
>> device/driver combination (even after the enable bit is restored).
>> 
>
> We should also be leaving the INTx irqs disabled.  So no irq
> should be generated.
>   

Even if the INTx line is not raised, you cannot rely on the device to
retain memory of a interrupt triggered while MSI are disabled, and
expect it to fire it under MSI form later when MSI are reenabled.  The
PCI spec does not provide any implicit or explicit guarantee about the
MSI enable flag that would allow it to be used for temporary masking
without running the risk of loosing such interrupts. Moreover, even if
you eventually call the interrupt handler to recover a lost-interrupt,
having switched the device to INTx mode (whether or not the INTx line
was forced down or not with the corresponding pci-command bit) without
informing the driver can (and will in our case) break interrupt
handshaking because MSI and INTx interrupts are not acked in the same
way (INTx requires an extra step that we don't do for MSI and that the
device will still expect unless going through driver init again).

> If you have a mask bit implemented you are required to be
> able to refire it after the msi is enabled. 

Indeed the masking case is well-defined by the spec (including the
operation of the pending bits). And my subject was definitely restricted
to devices without that masking capability.

>  I don't recall
> the requirements for when both intx and msi irqs are both
> disabled.  Intuitively I would expect no irq message to
> be generated, and at most the card would need to be polled
> manually to recognize a device event happened.
>
> Certainly firing an irq and having it get completely lost is
> unfortunate, and a major pain if you are trying to use the
> card.
>
> As for the previous no-op behavior that was a bug.
>   

OK no-op was a bug, but using the enable-bit for temporary masking
purposes still feels like a bug. I am afraid the only safe solution
might be to prohibit any operation that absolutely requires masking if
real masking is not available. Maybe the set_affinity method should
simply be disabled for device not supported masking (unless there is an
option of doing it without masking for instance by guaranteeing only one
word of the MSI capability is changed).

>   
>> Is there anything apart from irq migration that strongly requires
>> masking? Is is possible to do the irq migration without masking?
>> 
>
> enable_irq/disable_irq.  Although we can get away with a software
> emulation there and those are only needed if the driver calls them.
>   

I don't think there is a problem here, no sane driver would depend on
receiving edge interrupts triggered while irqs were explicitly disabled.

> The PCI spec requires disabling/masking the msi when reprogramming it.
> So as a general rule we can not do better. 

Do you have a reference for that requirement. The spec only vaguely
associates MSI programming with "configuration", but I haven't found any
explicit indication that it should not work.

>  Further because we are
> writing to multiple pci config registers the only way we can safely
> reprogram the message is with the msi disabled/masked on the card in
> some fashion.
>   

That's indeed a show-stopper.

> I suspect what needs to happen is a spec search to verify that the
> current linux behavior is at least reasonable within the spec.
>   

I don't see how you can disable MSI through the control bit (which is
equivalent to switching the device to INTx whether or not the INTx
disable bit is set in PCI_COMMAND) in the middle of operations, not tell
the driver, and not risk loosing interrupts (unless you rely on much
more than the spec).

> I don't want to break anyones hardware, but at the same time I want us
> to be careful and in spec for the default case.
>
>   

The interrupt while doing set_affinity masking would certainly cause a
problem for the device we use (MSI-enable switch between INTx and MSI
mode, and both interrupts are not acked the same way assuming they would
even be delivered to the driver), but I got some new data: upon further
examination, the lost interrupts we have seen seems in fact caused at a
different time:
- the problem is the  mask_ack_irq() done in

Re: [patch take 2][Intel-IOMMU] Fix for IOMMU early crash

2007-10-03 Thread Keshavamurthy, Anil S

On Thu, Oct 04, 2007 at 11:19:33AM +1000, Benjamin Herrenschmidt wrote:
> > Index: 2.6-mm/include/linux/pci.h
> > ===
> > --- 2.6-mm.orig/include/linux/pci.h 2007-10-03 13:48:20.0 -0700
> > +++ 2.6-mm/include/linux/pci.h  2007-10-03 13:49:08.0 -0700
> > @@ -195,6 +195,7 @@
> >  #ifdef CONFIG_PCI_MSI
> > struct list_head msi_list;
> >  #endif
> > +   void*iommu_private; /* hook for IOMMU specific extension */
> >  };
> 
> I'm not fan of this. That would imply that iommu stuff is specific to
> PCI or that sort of thing.
> 
> Why don't you use the new struct dev_archdata mechanism ? That's what I
> use on powerpc to provide optional iommu linkage to any device in the
> system.
Good one. I will certainly try out your idea and will update the list
tomorrow.

-Anil
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch take 2][Intel-IOMMU] Fix for IOMMU early crash

2007-10-03 Thread Benjamin Herrenschmidt

> Index: 2.6-mm/include/linux/pci.h
> ===
> --- 2.6-mm.orig/include/linux/pci.h   2007-10-03 13:48:20.0 -0700
> +++ 2.6-mm/include/linux/pci.h2007-10-03 13:49:08.0 -0700
> @@ -195,6 +195,7 @@
>  #ifdef CONFIG_PCI_MSI
>   struct list_head msi_list;
>  #endif
> + void*iommu_private; /* hook for IOMMU specific extension */
>  };

I'm not fan of this. That would imply that iommu stuff is specific to
PCI or that sort of thing.

Why don't you use the new struct dev_archdata mechanism ? That's what I
use on powerpc to provide optional iommu linkage to any device in the
system.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Version 4 (2.6.23-rc8-mm2) Smack: Simplified Mandatory Access Control Kernel

2007-10-03 Thread Casey Schaufler


--- Al Viro <[EMAIL PROTECTED]> wrote:

> On Wed, Oct 03, 2007 at 03:23:15PM -0700, Casey Schaufler wrote:
> > 1. Create /moldy at "_"
> > 2. For each label you care about
> >2a. Create /moldy/
> >2b. Set the label of /moldy/ to 
> > 3. ln -s /smack/tmp /tmp
> 
> > 1. Create /moldy at "_"
> > 2. For each label you care about
> >2a. Create /moldy/
> >2b. Set the label of /moldy/ to 
> > 3. ln -s /smack/tmp.link /tmp
>   4. mount --bind /moldy /smack/tmp
> or add
> /moldy /smack/tmp none bind,rw 0 2
> to /etc/fstab (same effect as (4))
> 
> Compare with your variant; the difference is in one argument of ln(1) and
> one additional line in rc script or /etc/fstab.  Sorry, but I don't buy
> the "extra setup complexity" argument at all.

What I'm confused about is how that results in a process labeled "foo"
getting a different /tmp from a process labeled "bar".

I guess I'll have to review your first post.

> > It's the content of a symlink, and that can be just about anything
> > and is not required to point to anything, which is one reason why
> > I made that choice. If you don't have a /tmp, or can't write to the
> > /tmp that exists, or have a /tmp that's a dangling symlink under
> > any circumstances you may have an issue. That's true regardless of
> > the presence or absense of /smack. All of the traditional mechanisms
> > for dealing with /tmp in a chrooted or namespaced environment remain.
> 
> It's not about symlink pointing to /smack/; it's about the place
> where /smack/ itself points to.  And _that_ can bloody well be
> different in different chroots.

Which is completely OK with me.

> Look, if you allow to change where it goes, you certainly allow different
> prefices on different boxen; moreover, admin can change it freely according
> to his layout on given box.  OTOH, you _can't_ have it different in different
> chroots and changing it in one will affect all of them.  See why that's a
> problem?

Yes, I can see that could be an unexpected behavior.

> > It's in a symlink on the filesystem, and it doesn't have to be an
> > absolute pathname, although since it's a symlink and the semantics
> > for a symlink allow that be be absolute, relative, or dangling I
> > don't see any reason to restrict it from being absolute.
> 
> Fixed-contents symlink (with or without variable tail - it's irrelevant
> here) is a bloody wrong tool for that kind of fs for the reasons described
> above.

I do not understand where the concept of Fixed-contents symlink
comes from. Yes, "tmp" is initialized to "/moldy/", but that can
be changed by writing to /smack/links. Please help me understand
the what you mean by fixed-contents symlinks. 

> And if you go for "prefix should point to location on the same fs"
> you can trivially configure the rest in userland (one line describing a
> binding), leaving the kernel-side stuff with something like "userland
> can ask for a pair of symlink and directory, having symlink resolve
> to directory + " instead of your "userland can ask for a symlink
> resolving to  + ".  And _that_ is chroot-neutral - you don't
> need to do any extra work...
>  
> > Could allowing multiple distinct mounts and symlink assignments
> > of /smackfs address those issues?
> 
> ... like that one.  Leave it to normal userland mechanisms; it's a matter
> of a single line in whatever script you are using to set chroot up and it
> involves _way_ fewer caveats.
> 
> That said, Alan's point still stands - if you don't get processes changing
> context back and forth, you don't need anything at all - we already have
> all we need for that kind of setups (and no, selinux is not involved ;-).


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-03 Thread Rusty Russell

On Mon, 2007-10-01 at 09:49 -0700, David Schwartz wrote:
> > * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> >
> > > BTW, it looks like risky to criticise sched_yield too much: some
> > > people can misinterpret such discussions and stop using this at all,
> > > even where it's right.
> 
> > Really, i have never seen a _single_ mainstream app where the use of
> > sched_yield() was the right choice.
> 
> It can occasionally be an optimization. You may have a case where you can do
> something very efficiently if a lock is not held, but you cannot afford to
> wait for the lock to be released. So you check the lock, if it's held, you
> yield and then check again. If that fails, you do it the less optimal way
> (for example, dispatching it to a thread that *can* afford to wait).

This used to be true, and still is if you want to be portable.  But the
point of futexes was precisely to attack this use case: whereas
sched_yield() says "I'm waiting for something, but I won't tell you
what" the futex ops tells the kernel what you're waiting for.

While the time to do a futex op is slightly slower than sched_yield(),
futexes win in so many cases that we haven't found a benchmark where
yield wins.  Yield-lose cases include:
1) There are other unrelated process that yield() ends up queueing
   behind.
2) The process you're waiting for doesn't conveniently sleep as soon as
   it releases the lock, so you wait for longer than intended,
3) You race between the yield and the lock being dropped.

In summary: spin N times & futex seems optimal.  The value of N depends
on the number of CPUs in the machine and other factors, but N=1 has
shown itself pretty reasonable.

Hope that helps,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] lguest: loading bzImage directly

2007-10-03 Thread H. Peter Anvin


Rusty Russell wrote:

On Wed, 2007-10-03 at 10:37 +0100, Chris Malley wrote:

Hi guys

Would it not be clearer to #include  and use 
the relevant named members of struct setup_header / struct boot_params

rather than the hard-coded values 0x202, 0x1F1, 0x214 ?


Yes, but unfortunately bootparam.h wasn't designed to be included from
userspace.

The patch would look like this, but it makes me wonder if it'd be better
to put all these user-exposed types in bootparam.h and have the other
headers include them.  hpa?



I don't have a strong preference either way, but I think what you have 
here is fine.


-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] lguest: loading bzImage directly

2007-10-03 Thread Rusty Russell

On Wed, 2007-10-03 at 10:37 +0100, Chris Malley wrote:
> Hi guys
> 
> Would it not be clearer to #include  and use 
> the relevant named members of struct setup_header / struct boot_params
> rather than the hard-coded values 0x202, 0x1F1, 0x214 ?

Yes, but unfortunately bootparam.h wasn't designed to be included from
userspace.

The patch would look like this, but it makes me wonder if it'd be better
to put all these user-exposed types in bootparam.h and have the other
headers include them.  hpa?

diff -r 6bb527d113a8 include/asm-i386/Kbuild
--- a/include/asm-i386/Kbuild   Wed Oct 03 13:49:31 2007 +1000
+++ b/include/asm-i386/Kbuild   Thu Oct 04 09:53:08 2007 +1000
@@ -6,7 +6,10 @@ header-y += msr-index.h
 header-y += msr-index.h
 header-y += ptrace-abi.h
 header-y += ucontext.h
+header-y += bootparam.h
 
+unifdef-y += e820.h
+unifdef-y += ist.h
 unifdef-y += msr.h
 unifdef-y += mtrr.h
 unifdef-y += vm86.h
diff -r 6bb527d113a8 include/asm-i386/bootparam.h
--- a/include/asm-i386/bootparam.h  Wed Oct 03 13:49:31 2007 +1000
+++ b/include/asm-i386/bootparam.h  Thu Oct 04 09:45:12 2007 +1000
@@ -10,82 +10,82 @@
 #include 
 
 struct setup_header {
-   u8  setup_sects;
-   u16 root_flags;
-   u32 syssize;
-   u16 ram_size;
-   u16 vid_mode;
-   u16 root_dev;
-   u16 boot_flag;
-   u16 jump;
-   u32 header;
-   u16 version;
-   u32 realmode_swtch;
-   u16 start_sys;
-   u16 kernel_version;
-   u8  type_of_loader;
-   u8  loadflags;
+   __u8setup_sects;
+   __u16   root_flags;
+   __u32   syssize;
+   __u16   ram_size;
+   __u16   vid_mode;
+   __u16   root_dev;
+   __u16   boot_flag;
+   __u16   jump;
+   __u32   header;
+   __u16   version;
+   __u32   realmode_swtch;
+   __u16   start_sys;
+   __u16   kernel_version;
+   __u8type_of_loader;
+   __u8loadflags;
 #define LOADED_HIGH(1<<0)
 #define KEEP_SEGMENTS  (1<<6)
 #define CAN_USE_HEAP   (1<<7)
-   u16 setup_move_size;
-   u32 code32_start;
-   u32 ramdisk_image;
-   u32 ramdisk_size;
-   u32 bootsect_kludge;
-   u16 heap_end_ptr;
-   u16 _pad1;
-   u32 cmd_line_ptr;
-   u32 initrd_addr_max;
-   u32 kernel_alignment;
-   u8  relocatable_kernel;
-   u8  _pad2[3];
-   u32 cmdline_size;
-   u32 hardware_subarch;
-   u64 hardware_subarch_data;
+   __u16   setup_move_size;
+   __u32   code32_start;
+   __u32   ramdisk_image;
+   __u32   ramdisk_size;
+   __u32   bootsect_kludge;
+   __u16   heap_end_ptr;
+   __u16   _pad1;
+   __u32   cmd_line_ptr;
+   __u32   initrd_addr_max;
+   __u32   kernel_alignment;
+   __u8relocatable_kernel;
+   __u8_pad2[3];
+   __u32   cmdline_size;
+   __u32   hardware_subarch;
+   __u64   hardware_subarch_data;
 } __attribute__((packed));
 
 struct sys_desc_table {
-   u16 length;
-   u8  table[14];
+   __u16 length;
+   __u8  table[14];
 };
 
 struct efi_info {
-   u32 _pad1;
-   u32 efi_systab;
-   u32 efi_memdesc_size;
-   u32 efi_memdesc_version;
-   u32 efi_memmap;
-   u32 efi_memmap_size;
-   u32 _pad2[2];
+   __u32 _pad1;
+   __u32 efi_systab;
+   __u32 efi_memdesc_size;
+   __u32 efi_memdesc_version;
+   __u32 efi_memmap;
+   __u32 efi_memmap_size;
+   __u32 _pad2[2];
 };
 
 /* The so-called "zeropage" */
 struct boot_params {
struct screen_info screen_info; /* 0x000 */
struct apm_bios_info apm_bios_info; /* 0x040 */
-   u8  _pad2[12];  /* 0x054 */
+   __u8  _pad2[12];/* 0x054 */
struct ist_info ist_info;   /* 0x060 */
-   u8  _pad3[16];  /* 0x070 */
-   u8  hd0_info[16];   /* obsolete! */ /* 0x080 */
-   u8  hd1_info[16];   /* obsolete! */ /* 0x090 */
+   __u8  _pad3[16];/* 0x070 */
+   __u8  hd0_info[16]; /* obsolete! */ /* 0x080 */
+   __u8  hd1_info[16]; /* obsolete! */ /* 0x090 */
struct sys_desc_table sys_desc_table;   /* 0x0a0 */
-   u8  _pad4[144]; /* 0x0b0 */
+   __u8  _pad4[144];   /* 0x0b0 */
struct edid_info edid_info; /* 0x140 */
struct efi_info efi_info;   /* 0x1c0 */
-   u32 alt_mem_k;  /* 0x1e0 */
-   u32 scratch;/* Scratch field! *//* 0x1e4 */
-   u8  e820_entries;   /* 0x1e8 */
-   u8  eddbuf_entries; /* 0x1e9 */
-   u8

Re: What's slated for inclusion in 2.6.24-rc1 from the NFS client git tree...

2007-10-03 Thread Jeff Garzik


Trond Myklebust wrote:

Aside from the usual updates from Chuck for NFS-over-IPv6 (still
incomplete) and a number of bugfixes for the text-based mount code, the
main news in the NFS tree is the merging of support for the NFS/RDMA
client code from Tom Talpey and the NetApp New England (NANE) team.

We also have the 64-bit inode support from RedHat/Peter Staubach.


The marketroids compel me to say:  It is Red Hat, not RedHat  :)

Jeff, looking forward to NFSv4 over IPv6


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] fix the softlockup watchdog to actually work

2007-10-03 Thread Yinghai Lu

On 7/17/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> * Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:
>
> > Ingo Molnar wrote:
> > > Subject: softlockup: fix Xen bogosity
> > > From: Ingo Molnar <[EMAIL PROTECTED]>
> > >
> > > this Xen related commit:
> > >
> >
> > Well, not just Xen.  It relates to any virtual environment: kvm,
> > lguest, vmi, xen...  (Not that they all implement a measure of
> > unstolen time.)
> >
> > How about a more descriptive patch title, along the lines of
> > "softlockup watchdog: fix rate limiting"?
>
> uhm, the problem was that it did not work _at all_, not something about
> 'rate limiting'. Yes, i got quite a bit grumpy when i found this,
> because you completely broke the softlockup watchdog via a pretty
> intrusive commit and you apparently didnt even do a minimal check
> whether its functionality was preserved! Updated patch for Andrew/Linus
> and for -stable attached.
>
> Ingo
>
> ->
> Subject: fix the softlockup watchdog to actually work
> From: Ingo Molnar <[EMAIL PROTECTED]>
>
> this Xen related commit:
>
>commit 966812dc98e6a7fcdf759cbfa0efab77500a8868
>Author: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
>Date:   Tue May 8 00:28:02 2007 -0700
>
>Ignore stolen time in the softlockup watchdog
>
> broke the softlockup watchdog to never report any lockups. (!)
>
> print_timestamp defaults to 0, this makes the following condition
> always true:
>
> if (print_timestamp < (touch_timestamp + 1) ||
>
> and we'll in essence never report soft lockups.
>
> apparently the functionality of the soft lockup watchdog was never
> actually tested with that patch applied ...
>
> [this is -stable material too.]
>
> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
> ---
>  kernel/softlockup.c |7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> Index: linux/kernel/softlockup.c
> ===
> --- linux.orig/kernel/softlockup.c
> +++ linux/kernel/softlockup.c
> @@ -79,10 +79,11 @@ void softlockup_tick(void)
> print_timestamp = per_cpu(print_timestamp, this_cpu);
>
> /* report at most once a second */
> -   if (print_timestamp < (touch_timestamp + 1) ||
> -   did_panic ||
> -   !per_cpu(watchdog_task, this_cpu))
> +   if ((print_timestamp >= touch_timestamp &&
> +   print_timestamp < (touch_timestamp + 1)) ||
> +   did_panic || !per_cpu(watchdog_task, this_cpu)) {
> return;
> +   }
>
> /* do not print during early bootup: */
> if (unlikely(system_state != SYSTEM_RUNNING)) {
> -

how about

diff --git a/kernel/softlockup.c b/kernel/softlockup.c
index 708d488..bbc0292 100644
--- a/kernel/softlockup.c
+++ b/kernel/softlockup.c
@@ -80,7 +80,7 @@ void softlockup_tick(void)
print_timestamp = per_cpu(print_timestamp, this_cpu);

/* report at most once a second */
-   if (print_timestamp < (touch_timestamp + 1) ||
+   if (((touch_timestamp - print_timestamp) < 1) ||
did_panic ||
!per_cpu(watchdog_task, this_cpu))


YH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.23-rc9 boot failure (megaraid?)

2007-10-03 Thread FUJITA Tomonori

On Wed, 3 Oct 2007 17:32:55 -0600
"Patro, Sumant" <[EMAIL PROTECTED]> wrote:

>  
> 
> > -Original Message-
> > From: FUJITA Tomonori [mailto:[EMAIL PROTECTED] 
> > Sent: Tuesday, October 02, 2007 5:01 PM
> > To: [EMAIL PROTECTED]
> > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> > linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; 
> > [EMAIL PROTECTED]; Patro, Sumant; DL-MegaRAID 
> > Linux; [EMAIL PROTECTED]
> > Subject: Re: 2.6.23-rc9 boot failure (megaraid?)
> > 
> > On Tue, 02 Oct 2007 15:38:13 -0500
> > James Bottomley <[EMAIL PROTECTED]> wrote:
> > 
> > > On Tue, 2007-10-02 at 20:15 +0200, Adrian Bunk wrote:
> > > > Cc's added, the complete bug report is at
> > > >   http://lkml.org/lkml/2007/10/2/243
> > > > 
> > > > On Tue, Oct 02, 2007 at 12:48:26PM -0400, Burton Windle wrote:
> > > > > 2.6.23-rc9 fails to boot for me; 2.6.22.9 works fine.
> > > > >
> > > > > System is a Dell Poweredge with PERC 2/DC with RAID1 volume.
> > > > >...
> > > > 
> > > > Thanks for your report.
> > > > 
> > > > Diff'ing the dmesg's shows:
> > > > 
> > > > <--  snip  -->
> > > > 
> > > >  scsi0: scanning scsi channel 4 [P0] for physical devices.
> > > >  scsi0: scanning scsi channel 5 [P1] for physical devices.
> > > >  st: Version 20070203, fixed bufsize 32768, s/g segs 256 -sd 
> > > > 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB)
> > > > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512.
> > > > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB)
> > > >  sd 0:0:0:0: [sda] Write Protect is off  sd 0:0:0:0: [sda] Asking 
> > > > for cache data failed  sd 0:0:0:0: [sda] Assuming drive 
> > cache: write 
> > > > through -sd 0:0:0:0: [sda] 17547264 512-byte hardware 
> > sectors (8984 
> > > > MB)
> > > > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512.
> > > > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB)
> > > >  sd 0:0:0:0: [sda] Write Protect is off  sd 0:0:0:0: [sda] Asking 
> > > > for cache data failed  sd 0:0:0:0: [sda] Assuming drive 
> > cache: write 
> > > > through
> > > >   sda: sda1
> > > > + sda: p1 exceeds device capacity
> > > > 
> > > > <--  snip  -->
> > > > 
> > > > -   case MEGA_BULK_DATA:
> > > > -   if (scb->cmd->use_sg == 0)
> > > > -   length = scb->cmd->request_bufflen;
> > > > -   else {
> > > > -   struct scatterlist *sgl =
> > > > -   (struct scatterlist 
> > *)scb->cmd->request_buffer;
> > > > -   length = sgl->length;
> > > > -   }
> > > > -   pci_unmap_page(adapter->dev, scb->dma_h_bulkdata,
> > > > -  length, scb->dma_direction);
> > > > -   break;
> > > > -
> > > 
> > > This is the problem piece I think.  We've reintroduced a 
> > very old bug:
> > > 
> > > commit 51c928c34fa7cff38df584ad01de988805877dba
> > > Author: James Bottomley <[EMAIL PROTECTED]>
> > > Date:   Sat Oct 1 09:38:05 2005 -0500
> > > 
> > > [SCSI] Legacy MegaRAID: Fix READ CAPACITY
> > > 
> > > Some Legacy megaraid cards can't actually cope with the 
> > scatter/gather
> > > version of the READ CAPACITY command (which is what we 
> > now send them
> > > since altering all SCSI internal I/O to go via the 
> > block layer).  Fix
> > > this (and a few other broken megaraid driver 
> > assumptions) by sending
> > > the non-sg version of the command if the sg list only 
> > has a single
> > > element.
> > > 
> > > Signed-off-by: James Bottomley <[EMAIL PROTECTED]>
> > > 
> > > So what we have to do is put back the check for use_sg == 1 
> > and send 
> > > that as a bulk transfer command.
> > 
> > Sorry about this. Can this fix the problem?
> > 
> > Thanks,
> > 
> > 
> > diff --git a/drivers/scsi/megaraid.c 
> > b/drivers/scsi/megaraid.c index 3907f67..da56163 100644
> > --- a/drivers/scsi/megaraid.c
> > +++ b/drivers/scsi/megaraid.c
> > @@ -1753,6 +1753,14 @@ mega_build_sglist(adapter_t *adapter, 
> > scb_t *scb, u32 *buf, u32 *len)
> >  
> > *len = 0;
> >  
> > +   if (scsi_sg_count(cmd) == 1 && !adapter->has_64bit_addr) {
> > +   sg = scsi_sglist(cmd);
> > +   scb->dma_h_bulkdata = sg_dma_address(sg);
> > +   *buf = (u32)scb->dma_h_bulkdata;
> > +   *len = sg_dma_len(sg);
> > +   return 0;
> > +   }
> > +
> > scsi_for_each_sg(cmd, sg, sgcnt, idx) {
> > if (adapter->has_64bit_addr) {
> > scb->sgl64[idx].address = sg_dma_address(sg);
> > 
> 
> 
> With this patch I see the correct logical disk size reported.
> Thanks.

Great, thanks for testing!

Can you try the following patch instead of the above patch?

http://marc.info/?l=linux-scsi=119137033016550=2


I know the changes are pretty trivial and it should work...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

What's slated for inclusion in 2.6.24-rc1 from the NFS client git tree...

2007-10-03 Thread Trond Myklebust

Aside from the usual updates from Chuck for NFS-over-IPv6 (still
incomplete) and a number of bugfixes for the text-based mount code, the
main news in the NFS tree is the merging of support for the NFS/RDMA
client code from Tom Talpey and the NetApp New England (NANE) team.

We also have the 64-bit inode support from RedHat/Peter Staubach.

There is also the addition of a nfs_vm_page_mkwrite() method in order to
clean up the mmap() write code.
Finally, I've been working on a number of updates for the attribute
revalidation, having pulled apart most of the dentry and attribute
revalidation into separate variables. A number of fixes that address
existing bugs fell out of that review, which should hopefully result in
more efficient dcache behaviour...

The NFS client git tree can be found at

   git://git.linux-nfs.org/pub/linux/nfs-2.6.git

or on gitweb at

  http://linux-nfs.org/cgi-bin/gitweb.cgi?p=nfs-2.6.git;a=summary

Finally, a full set of patches may be found on

  http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc9/

Cheers
  Trond

---

Adrian Bunk (1):
  [2.6 patch] net/sunrpc/rpcb_clnt.c: make struct rpcb_program static

Christoph Hellwig (1):
  [NFS] [PATCH] nfs: tiny makefile cleanup

Chuck Lever (41):
  SUNRPC: Fix a signed v. unsigned comparison in rpcbind's XDR routines
  SUNRPC: Fix a signed v. unsigned comparison in net/sunrpc/xprtsock.c
  SUNRPC: Use standard macros for printing IP addresses
  SUNRPC: Free address buffers in a loop
  SUNRPC: Add hex-formatted address support to rpc_peeraddr2str()
  SUNRPC: Rename xs_format_peer_addresses
  SUNRPC: add a function to format IPv6 addresses
  SUNRPC: add support for IPv6 to the kernel's rpcbind client
  SUNRPC: Introduce support for setting the port number in IPv6 addresses
  SUNRPC: Rename xs_bind() to prepare for IPv6-specific bind method
  SUNRPC: create an IPv6-savvy mechanism for binding to a reserved port
  SUNRPC: Refactor a part of socket connect logic into a helper function
  SUNRPC: Rename IPv4 connect workers
  SUNRPC: create connect workers for IPv6
  SUNRPC: Add IPv6 address support to net/sunrpc/xprtsock.c
  SUNRPC: Add a helper for extracting the address using the correct type
  SUNRPC: Split xs_reclassify_socket into an IPv4 and IPv6 version
  SUNRPC: Add support for formatted universal addresses
  SUNRPC: Fix generation of universal addresses for
  SUNRPC: Only one dprintk is needed during client creation
  SUNRPC: fix a signed v. unsigned comparison nit in rpc_bind_new_program
  SUNRPC: Use correct argument type in memcpy()
  SUNRPC: Make sure server name is reasonable before trying to print it
  SUNRPC: Clean up in rpc_show_tasks
  SUNRPC: Make rpcb_decode_getaddr more picky about universal addresses
  SUNRPC: Retry bad rpcbind replies
  SUNRPC: Add a new error code for retry waiting for another binder
  SUNRPC: Split another new rpcbind retry error code from EACCES
  SUNRPC: RPC bind failures should be permanent for NULL requests
  NFS: Kernel mount client should use async bind
  NFS: Add new 'mountaddr=' mount option
  NFS: Convert printk's to dprintk's in fs/nfs/nfs?xdr.c
  LOCKD: Convert printk's to dprintk's in lockd XDR routines
  NFSD: Convert printk's to dprintk's in NFSD's nfs4xdr
  NFS: Verify server address before invoking in-kernel mount client
  NFS: Show "nointr" mount option
  SUNRPC: Fix bytes-per-op accounting for RPC over UDP
  NFS: Don't call nfs_renew_times() in nfs_dentry_iput()
  NFS: Eliminate nfs_renew_times()
  NFS: Eliminate nfs_refresh_verifier()
  SUNRPC: Use correct type in buffer length calculations

Fabio Olive Leite (1):
  Re: [NFS] [PATCH] Attribute timeout handling and wrapping u32 jiffies

J. Bruce Fields (2):
  nfs: add server port to rpc_pipe info file
  SUNRPC: Fix default hostname created in rpc_create()

James Lentini (1):
  [NFS] [PATCH] NFS: initialize default port in kernel mount client

Jeff Layton (1):
  [NFS] [PATCH] NFS: show addr=ipaddr in /proc/mounts rather than

Jesper Juhl (1):
  [23/37] Clean up duplicate includes in

Peter Staubach (1):
  64 bit ino support for NFS client

Trond Myklebust (56):
  NFS: Add the helper nfs_vm_page_mkwrite
  NFS: Clean up write code...
  NFS: Clean up nfs_writepages()
  VFS: Remove writeback_control->fs_private
  NFS: Clean up NFS writeback flush code
  NFS: Writeback optimisation
  NFS: Fall back to synchronous writes when a background write errors...
  SUNRPC: Convert rpc_pipefs to use the generic filesystem notification 
hooks
  NFSv4: Fix a bug in nfs4_validate_mount_data()
  NFS: Add a helper to extract the nfs_open_context from a struct file
  NFS: Replace file->private_data with calls to nfs_file_open_context()
  NFSv4: Simplify _nfs4_do_access()
  NFSv4: Make NFSv4 ACCESS

Re: [PATCH] net: fix race in process_backlog

2007-10-03 Thread David Miller

From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Wed, 3 Oct 2007 15:05:19 -0700

> On Wed, 03 Oct 2007 14:58:07 -0700 (PDT)
> David Miller <[EMAIL PROTECTED]> wrote:
> 
> > From: Peter Zijlstra <[EMAIL PROTECTED]>
> > Date: Wed, 03 Oct 2007 17:44:53 +0200
> > 
> > > Index: linux-2.6/net/core/dev.c
> > > ===
> > > --- linux-2.6.orig/net/core/dev.c
> > > +++ linux-2.6/net/core/dev.c
> > > @@ -2095,11 +2095,11 @@ static int process_backlog(struct napi_s
> > >  
> > >   local_irq_disable();
> > >   skb = __skb_dequeue(>input_pkt_queue);
> > > - local_irq_enable();
> > >   if (!skb) {
> > > - napi_complete(napi);
> > > + __napi_complete(napi);
> > >   break;
> > >   }
> > > + local_irq_enable();
> > 
> > What re-enables interrupts in the !skb path?
> 
> This looks like a better fix. the irq_enable is needed in both cases.

Yep, applied, thanks Peter and Stephen.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.23-rc9 boot failure (megaraid?)

2007-10-03 Thread Patro, Sumant

 

> -Original Message-
> From: FUJITA Tomonori [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, October 02, 2007 5:01 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; Patro, Sumant; DL-MegaRAID 
> Linux; [EMAIL PROTECTED]
> Subject: Re: 2.6.23-rc9 boot failure (megaraid?)
> 
> On Tue, 02 Oct 2007 15:38:13 -0500
> James Bottomley <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 2007-10-02 at 20:15 +0200, Adrian Bunk wrote:
> > > Cc's added, the complete bug report is at
> > >   http://lkml.org/lkml/2007/10/2/243
> > > 
> > > On Tue, Oct 02, 2007 at 12:48:26PM -0400, Burton Windle wrote:
> > > > 2.6.23-rc9 fails to boot for me; 2.6.22.9 works fine.
> > > >
> > > > System is a Dell Poweredge with PERC 2/DC with RAID1 volume.
> > > >...
> > > 
> > > Thanks for your report.
> > > 
> > > Diff'ing the dmesg's shows:
> > > 
> > > <--  snip  -->
> > > 
> > >  scsi0: scanning scsi channel 4 [P0] for physical devices.
> > >  scsi0: scanning scsi channel 5 [P1] for physical devices.
> > >  st: Version 20070203, fixed bufsize 32768, s/g segs 256 -sd 
> > > 0:0:0:0: [sda] 17547264 512-byte hardware sectors (8984 MB)
> > > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512.
> > > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB)
> > >  sd 0:0:0:0: [sda] Write Protect is off  sd 0:0:0:0: [sda] Asking 
> > > for cache data failed  sd 0:0:0:0: [sda] Assuming drive 
> cache: write 
> > > through -sd 0:0:0:0: [sda] 17547264 512-byte hardware 
> sectors (8984 
> > > MB)
> > > +sd 0:0:0:0: [sda] Sector size 0 reported, assuming 512.
> > > +sd 0:0:0:0: [sda] 1 512-byte hardware sectors (0 MB)
> > >  sd 0:0:0:0: [sda] Write Protect is off  sd 0:0:0:0: [sda] Asking 
> > > for cache data failed  sd 0:0:0:0: [sda] Assuming drive 
> cache: write 
> > > through
> > >   sda: sda1
> > > + sda: p1 exceeds device capacity
> > > 
> > > <--  snip  -->
> > > 
> > > - case MEGA_BULK_DATA:
> > > - if (scb->cmd->use_sg == 0)
> > > - length = scb->cmd->request_bufflen;
> > > - else {
> > > - struct scatterlist *sgl =
> > > - (struct scatterlist 
> *)scb->cmd->request_buffer;
> > > - length = sgl->length;
> > > - }
> > > - pci_unmap_page(adapter->dev, scb->dma_h_bulkdata,
> > > -length, scb->dma_direction);
> > > - break;
> > > -
> > 
> > This is the problem piece I think.  We've reintroduced a 
> very old bug:
> > 
> > commit 51c928c34fa7cff38df584ad01de988805877dba
> > Author: James Bottomley <[EMAIL PROTECTED]>
> > Date:   Sat Oct 1 09:38:05 2005 -0500
> > 
> > [SCSI] Legacy MegaRAID: Fix READ CAPACITY
> > 
> > Some Legacy megaraid cards can't actually cope with the 
> scatter/gather
> > version of the READ CAPACITY command (which is what we 
> now send them
> > since altering all SCSI internal I/O to go via the 
> block layer).  Fix
> > this (and a few other broken megaraid driver 
> assumptions) by sending
> > the non-sg version of the command if the sg list only 
> has a single
> > element.
> > 
> > Signed-off-by: James Bottomley <[EMAIL PROTECTED]>
> > 
> > So what we have to do is put back the check for use_sg == 1 
> and send 
> > that as a bulk transfer command.
> 
> Sorry about this. Can this fix the problem?
> 
> Thanks,
> 
> 
> diff --git a/drivers/scsi/megaraid.c 
> b/drivers/scsi/megaraid.c index 3907f67..da56163 100644
> --- a/drivers/scsi/megaraid.c
> +++ b/drivers/scsi/megaraid.c
> @@ -1753,6 +1753,14 @@ mega_build_sglist(adapter_t *adapter, 
> scb_t *scb, u32 *buf, u32 *len)
>  
>   *len = 0;
>  
> + if (scsi_sg_count(cmd) == 1 && !adapter->has_64bit_addr) {
> + sg = scsi_sglist(cmd);
> + scb->dma_h_bulkdata = sg_dma_address(sg);
> + *buf = (u32)scb->dma_h_bulkdata;
> + *len = sg_dma_len(sg);
> + return 0;
> + }
> +
>   scsi_for_each_sg(cmd, sg, sgcnt, idx) {
>   if (adapter->has_64bit_addr) {
>   scb->sgl64[idx].address = sg_dma_address(sg);
> 


With this patch I see the correct logical disk size reported.
Thanks.

Sumant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Decreasing stime running confuses top (was: top displaying 9999% CPU usage)

2007-10-03 Thread Frans Pop

On Wednesday 03 October 2007, you wrote:
> On Wed, Oct 03, 2007 at 09:27:41PM +0200, Frans Pop wrote:
> > On Wednesday 03 October 2007, you wrote:
> > > On Wed, 3 Oct 2007, Ilpo Järvinen wrote:
> > > > On Wed, 3 Oct 2007, Frans Pop wrote:
> > > > > The only change is in 2 consecutive columns: "2911 502" -> "2912
> > > > > 500". Is processor usage calculated from those? Can someone
> > > > > explain how?
> > > >
> > > > The latter seems to be utime ...decreasing. No wonder if
> > > > arithmetics will give strange results (probably top is using
> > > > unsigned delta?)...
> > >
> > > Hmm, minor miscounting from my side, stime seems more appropriate...
> >
> > So, is it normal that stime decreases sometimes or a kernel bug?
> > /me expects the last...
>
> Let me guess... Dual core AMD64 ?

Nope: Intel(R) Pentium(R) D CPU 3.20GHz

> I'm 99.99% sure that if you boot with "notsc", the problem disappears.

Not really. With that first test I did have:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

If I boot with 'notsc', I get:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

But the problem is still exactly the same:
Oct 04 00:53:37 545 92
Oct 04 00:53:38 545 94
Oct 04 00:53:43 546 92  <--
Oct 04 00:53:49 547 94
Oct 04 00:53:54 549 93  <--
Oct 04 00:54:00 550 94

Some relevant lines from kernel log:
checking TSC synchronization [CPU#0 -> CPU#1]: passed  <--- Not there with 
'notsc'
hpet0: at MMIO 0xfed0, IRQs 2, 8, 0
hpet0: 3 64-bit timers, 14318180 Hz
ACPI: RTC can wake from S4
Time: hpet clocksource has been installed.
hpet_resources: 0xfed0 is busy
Time: tsc clocksource has been installed.  <--- Not there with 'notsc'

> If so, you have one of those wonderful AMD64 with unsynced clock and
> without HPET to sync with. I wrote a simple program in the past to exhibit
> the problem. It would bsimply run "date +%s" in a busy loops and display
> each time it would change. Amazing. It could jump back and forth by up to
> 3 seconds!

I tried your script, but the clock runs perfectly. Never saw anything
other than a 1 second increment.

The following may well be relevant.
With 2.6.22 and early 2.6.23-rc kernels (rc3-rc6) I often had this in my
kernel log (see http://lkml.org/lkml/2007/9/16/45):
   checking TSC synchronization [CPU#0 -> CPU#1]:
   Measured 248 cycles TSC warp between CPUs, turning off TSC clock.
   Marking TSC unstable due to check_tsc_sync_source failed

Some boots the TSC synchronization would be OK, but I'd see ~2/3 failures.
Kernels before 2.6.22 did not have this problem.

However, checking my logs now I see that these messages have disappeared
since 2.6.23-rc7. Now the TSC synchronization check always passes.

I also tried with 2.6.22-6 and with that the jumping around is _not_
present. This was a boot where TSC synchronization failed, so with hpet
as clocksource.
Also, the numbers stay constant much longer and have bigger increments
(updates look to be once per minute?):
Oct 04 01:24:19 465 67
Oct 04 01:24:50 467 69
Oct 04 01:24:51 469 72
Oct 04 01:25:51 474 76
Oct 04 01:26:50 478 80

Cheers,
Frans Pop
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2 -mm] capabilities: introduce per-process capability bounding set (v4)

2007-10-03 Thread James Morris

On Wed, 3 Oct 2007, Serge E. Hallyn wrote:


> 
> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>

Acked-by: James Morris <[EMAIL PROTECTED]>

-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Version 4 (2.6.23-rc8-mm2) Smack: Simplified Mandatory Access Control Kernel

2007-10-03 Thread Al Viro

On Wed, Oct 03, 2007 at 03:23:15PM -0700, Casey Schaufler wrote:
> 1. Create /moldy at "_"
> 2. For each label you care about
>2a. Create /moldy/
>2b. Set the label of /moldy/ to 
> 3. ln -s /smack/tmp /tmp

> 1. Create /moldy at "_"
> 2. For each label you care about
>2a. Create /moldy/
>2b. Set the label of /moldy/ to 
> 3. ln -s /smack/tmp.link /tmp
  4. mount --bind /moldy /smack/tmp
or add
/moldy /smack/tmp none bind,rw 0 2
to /etc/fstab (same effect as (4))

Compare with your variant; the difference is in one argument of ln(1) and
one additional line in rc script or /etc/fstab.  Sorry, but I don't buy
the "extra setup complexity" argument at all.

> It's the content of a symlink, and that can be just about anything
> and is not required to point to anything, which is one reason why
> I made that choice. If you don't have a /tmp, or can't write to the
> /tmp that exists, or have a /tmp that's a dangling symlink under
> any circumstances you may have an issue. That's true regardless of
> the presence or absense of /smack. All of the traditional mechanisms
> for dealing with /tmp in a chrooted or namespaced environment remain.

It's not about symlink pointing to /smack/; it's about the place
where /smack/ itself points to.  And _that_ can bloody well be
different in different chroots.

Look, if you allow to change where it goes, you certainly allow different
prefices on different boxen; moreover, admin can change it freely according
to his layout on given box.  OTOH, you _can't_ have it different in different
chroots and changing it in one will affect all of them.  See why that's a
problem?

> It's in a symlink on the filesystem, and it doesn't have to be an
> absolute pathname, although since it's a symlink and the semantics
> for a symlink allow that be be absolute, relative, or dangling I
> don't see any reason to restrict it from being absolute.

Fixed-contents symlink (with or without variable tail - it's irrelevant
here) is a bloody wrong tool for that kind of fs for the reasons described
above.  And if you go for "prefix should point to location on the same fs"
you can trivially configure the rest in userland (one line describing a
binding), leaving the kernel-side stuff with something like "userland
can ask for a pair of symlink and directory, having symlink resolve
to directory + " instead of your "userland can ask for a symlink
resolving to  + ".  And _that_ is chroot-neutral - you don't
need to do any extra work...

> Could allowing multiple distinct mounts and symlink assignments
> of /smackfs address those issues?

... like that one.  Leave it to normal userland mechanisms; it's a matter
of a single line in whatever script you are using to set chroot up and it
involves _way_ fewer caveats.

That said, Alan's point still stands - if you don't get processes changing
context back and forth, you don't need anything at all - we already have
all we need for that kind of setups (and no, selinux is not involved ;-).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix blktrace setup 32-bit ioctl on 64-bit kernels

2007-10-03 Thread Arnd Bergmann

On Wednesday 03 October 2007, Arnd Bergmann wrote:
> Jens, I think the best overall solution would be to have a
> block/compat_ioctl.c file with all the compat handling for block
> devices moved over from fs/compat_ioctl.c, and done in a nicer way.
> If you agree, with this approach, I'd volunteer to come up with a
> patch.

Sometimes I find it hard to stop myself once I have the idea.
The patch below moves all block related ioctl conversion out
of fs/compat_ioctl.c into the compat_blkdev_ioctl() function.

Is that a direction we should be heading towards? If so, I
can do some testing and split this big patch into more
logical units for better review.

I also found a few interesting bugs in the process:
* BLKRASET is both ULONG_IOCTL and COMPATIBLE_IOCTL, but this should
  be entirely harmless
* BLKSECTGET writes 2 bytes normally, our compat_ version writes
  4 bytes!
* FDSETPRM32, FDDEFPRM32 and FDGETPRM32 are doing potentially dangerous
  stuff with kernel pointers. sparse actually warns about this, but
  from what I could see from floppy.c, the kernel always ignores these
  pointers. I did not check any other floppy drivers implementing the
  same calls.

Arnd <><

Index: linux-2.6/block/Makefile
===
--- linux-2.6.orig/block/Makefile
+++ linux-2.6/block/Makefile
@@ -11,3 +11,4 @@ obj-$(CONFIG_IOSCHED_DEADLINE)+= deadli
 obj-$(CONFIG_IOSCHED_CFQ)  += cfq-iosched.o
 
 obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
+obj-$(CONFIG_COMPAT)   += compat_ioctl.o
Index: linux-2.6/block/compat_ioctl.c
===
--- /dev/null
+++ linux-2.6/block/compat_ioctl.c
@@ -0,0 +1,772 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+static int compat_put_ushort(unsigned long arg, unsigned short val)
+{
+   return put_user(val, (unsigned short __user *)compat_ptr(arg));
+}
+
+static int compat_put_int(unsigned long arg, int val)
+{
+   return put_user(val, (compat_int_t __user *)compat_ptr(arg));
+}
+
+static int compat_put_long(unsigned long arg, long val)
+{
+   return put_user(val, (compat_long_t __user *)compat_ptr(arg));
+}
+
+static int compat_put_ulong(unsigned long arg, compat_ulong_t val)
+{
+   return put_user(val, (compat_ulong_t __user *)compat_ptr(arg));
+}
+
+static int compat_put_u64(unsigned long arg, u64 val)
+{
+   return put_user(val, (compat_u64 __user *)compat_ptr(arg));
+}
+
+struct compat_hd_geometry {
+   unsigned char heads;
+   unsigned char sectors;
+   unsigned short cylinders;
+   u32 start;
+};
+
+static int compat_hdio_getgeo(struct gendisk *disk, struct block_device *bdev,
+   struct compat_hd_geometry __user *ugeo)
+{
+   struct hd_geometry geo;
+   int ret;
+
+   if (!ugeo)
+   return -EINVAL;
+   if (!disk->fops->getgeo)
+   return -ENOTTY;
+
+   /*
+* We need to set the startsect first, the driver may
+* want to override it.
+*/
+   geo.start = get_start_sect(bdev);
+   ret = disk->fops->getgeo(bdev, );
+   if (ret)
+   return ret;
+
+   ret = copy_to_user(ugeo, , 4);
+   ret |= __put_user(geo.start, >start);
+   if (ret)
+   ret = -EFAULT;
+
+   return ret;
+}
+
+static int compat_hdio_ioctl(struct inode *inode, struct file *file,
+   struct gendisk *disk, unsigned int cmd, unsigned long arg)
+{
+   mm_segment_t old_fs = get_fs();
+   unsigned long kval;
+   unsigned int __user *uvp;
+   int error;
+
+   set_fs(KERNEL_DS);
+   error = blkdev_driver_ioctl(inode, file, disk,
+   cmd, (unsigned long)());
+   set_fs(old_fs);
+
+   if (error == 0) {
+   uvp = compat_ptr(arg);
+   if(put_user(kval, uvp))
+   error = -EFAULT;
+   }
+   return error;
+}
+
+struct cdrom_read_audio32 {
+   union cdrom_addraddr;
+   u8  addr_format;
+   compat_int_tnframes;
+   compat_caddr_t  buf;
+};
+
+struct cdrom_generic_command32 {
+   unsigned char   cmd[CDROM_PACKET_SIZE];
+   compat_caddr_t  buffer;
+   compat_uint_t   buflen;
+   compat_int_tstat;
+   compat_caddr_t  sense;
+   unsigned char   data_direction;
+   compat_int_tquiet;
+   compat_int_ttimeout;
+   compat_caddr_t  reserved[1];
+};
+
+static int cdrom_do_read_audio(struct inode *inode, struct file *file,
+   struct gendisk *disk, unsigned int cmd, unsigned long arg)
+{
+   struct cdrom_read_audio __user *cdread_audio;
+   struct cdrom_read_audio32 __user *cdread_audio32;
+   __u32 data;
+   void __user *datap;
+
+   cdread_audio =

Re: [PATCH 1/2 -mm] capabilities: define CONFIG_COMMONCAP

2007-10-03 Thread James Morris

On Wed, 3 Oct 2007, Serge E. Hallyn wrote:

> >From 54c70ca7671750fe8986451fae91d42107d0ca90 Mon Sep 17 00:00:00 2001
> From: Serge E. Hallyn <[EMAIL PROTECTED]>
> Date: Fri, 28 Sep 2007 10:33:33 -0500
> Subject: [PATCH 1/2 -mm] capabilities: define CONFIG_COMMONCAP
> 
> currently the compilation of commoncap.c is determined
> through Makefile logic.  So there is no single CONFIG
> variable which can be relied upon to know whether it
> will be compiled.
> 
> Define CONFIG_COMMONCAP to be true when lsm is not
> compiled in, or when the capability or rootplug modules
> are compiled.  These are the cases when commoncap is
> currently compiled.  Use this variable in security/Makefile
> to determine commoncap.c's compilation.
> 
> Apart from being a logic cleanup, this is needed by the
> upcoming cap_bset patch so that prctl can know whether
> PR_SET_BSET should be allowed.
> 
> Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]>

Acked-by: James Morris <[EMAIL PROTECTED]>


-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + add-documentation-w1w1-masters-00-index.patch added to -mm tree

2007-10-03 Thread Randy Dunlap

On Wed, 03 Oct 2007 14:17:33 -0700 [EMAIL PROTECTED] wrote:

> 
> The patch titled
>  Add Documentation/{w1,w1/masters}/00-INDEX
> has been added to the -mm tree.  Its filename is
>  add-documentation-w1w1-masters-00-index.patch
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> out what to do about this
> 
> --
> Subject: Add Documentation/{w1,w1/masters}/00-INDEX
> From: Rob Landley <[EMAIL PROTECTED]>
> 
> Two 00-INDEX files under Documentation/w1
> 
> Signed-off-by: Rob Landley <[EMAIL PROTECTED]>
> Cc: Evgeniy Polyakov <[EMAIL PROTECTED]>
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> ---
> 
> 
> diff -puN /dev/null Documentation/w1/00-INDEX
> --- /dev/null
> +++ a/Documentation/w1/00-INDEX
> @@ -0,0 +1,8 @@
> +00-INDEX
> + - This file
> +masters/
> + - Individual chips providing 1-wire busses.
> +w1.generic
> + - The 1-wire (w1) bus
> +w1.netlink
> + - Userspace communication protocol over connector [1].
> diff -puN /dev/null Documentation/w1/masters/00-INDEX
> --- /dev/null
> +++ a/Documentation/w1/masters/00-INDEX
> @@ -0,0 +1,6 @@
> +00-INDEX
> + - This file
> +ds2482
> + - The Maixm/Dallas Semiconductor DS2482 provides 1-wire busses.
> +ds2490
> + - The Maixm/Dallas Semiconductor DS2490 builds USB <-> W1 bridges.

  Maxim (2 times)

Was this patch posted to a mailing list?  if so, which one?
I didn't see it.
And if not, please post all patches for review.


---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git patches] net driver updates

2007-10-03 Thread David Miller

From: Jeff Garzik <[EMAIL PROTECTED]>
Date: Wed, 3 Oct 2007 14:39:16 -0400

> 
> Normally I wait a day or two between pushes, to queue up patches and
> also to avoid annoying my upstream :)  But this includes a couple fixes
> I felt should be upstreamed sooner rather than later.
> 
> Please pull from 'upstream' branch of
> master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream

Pulled, thanks Jeff!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2 -mm] capabilities: introduce per-process capability bounding set (v4)

2007-10-03 Thread Serge E. Hallyn

>From d93ecb90d82f9e2b7f48c74f5e6ed97cac3683c7 Mon Sep 17 00:00:00 2001
From: Serge Hallyn <[EMAIL PROTECTED]>
Date: Fri, 28 Sep 2007 10:33:56 -0500
Subject: [PATCH 2/2 -mm] capabilities: introduce per-process capability 
bounding set (v4)

The capability bounding set is a set beyond which capabilities
cannot grow.  Currently cap_bset is per-system.  It can be
manipulated through sysctl, but only init can add capabilities.
Root can remove capabilities.  By default it includes all caps
except CAP_SETPCAP.

This patch makes the bounding set per-process.  It is inherited
at fork from parent.  Noone can add elements.  CAP_SYS_ADMIN is
required to remove them.  The reason for that is to stop an
unprivileged user from removing key capabilities, then running
a setuid root binary (or one with file capabilities) which
assumes it got all, not some, of the capabilities it needs
and fails unsafely partway through.  Perhaps a new capability
should be introduced rather than using CAP_SYS_ADMIN.  Or,
since I don't have a *concrete* example of an exploit, perhaps
any user should be able to drop capabilities from his capbset.

One example use of this is to start a safer container.  For
instance, until device namespaces or per-container device
whitelists are introduced, it is probably wise to take CAP_MKNOD
away from a container.

The following hacky test program will get and set the bounding
 set.  For instance

./bset get
(lists capabilities in bset)
./bset strset cap_sys_admin
(starts shell with new bset)
(use capset, setuid binary, or binary with
file capabilities to try to increase caps)

===
bset.c:
===

unsigned long newval;
int cmd_getbcap;

char *captable[] = {
"cap_dac_override",
"cap_dac_read_search",
"cap_fowner",
"cap_fsetid",
"cap_kill",
"cap_setgid",
"cap_setuid",
"cap_setpcap",
"cap_linux_immutable",
"cap_net_bind_service",
"cap_net_broadcast",
"cap_net_admin",
"cap_net_raw",
"cap_ipc_lock",
"cap_ipc_owner",
"cap_sys_module",
"cap_sys_rawio",
"cap_sys_chroot",
"cap_sys_ptrace",
"cap_sys_pacct",
"cap_sys_admin",
"cap_sys_boot",
"cap_sys_nice",
"cap_sys_resource",
"cap_sys_time",
"cap_sys_tty_config",
"cap_mknod",
"cap_lease",
"cap_audit_write",
"cap_audit_control",
"cap_setfcap"};

char *inttocap(unsigned long v)
{
char *str = NULL;
int i;

str = malloc(1);
str[0] = '\0';
for (i=0; i<31; i++) {
if (v & (1 << (i+1))) {
char *tmp = captable[i];
str = realloc(str, strlen(str)+2+strlen(tmp));
sprintf(str+strlen(str), ",%s", tmp);
}
}
return str;
}

int getbcap(void)
{
unsigned long bcap;
int ret;
unsigned int ver;

ret = prctl(PR_GET_CAPBSET, , );
if (ret == -1)
perror("prctl");
if (ver != _LINUX_CAPABILITY_VERSION)
printf("wrong capability version: %lu not %lu\n",
ver, _LINUX_CAPABILITY_VERSION);
printf("prctl get_bcap returned %lu (ret %d)\n", bcap, ret);
printf("that is %s\n", inttocap(bcap));
return ret;
}

int setbcap(unsigned long val)
{
int ret;

ret = prctl(PR_SET_CAPBSET, _LINUX_CAPABILITY_VERSION, val);
return ret;
}

int usage(char *me)
{
printf("Usage: %s get\n", me);
printf("   %s set \n", me);
printf("   %s strset capability_string\n", me);
printf(" capability_string is for instance:\n");
printf(" cap_sys_admin,cap_mknod,cap_dac_override\n");
return 1;
}

unsigned long captoint(char *cap)
{
if (strcmp(cap, "cap_dac_override") == 0)
return 1;
else if (strcmp(cap, "cap_dac_read_search") == 0)
return 2;
else if (strcmp(cap, "cap_fowner") == 0)
return 3;
else if (strcmp(cap, "cap_fsetid") == 0)
return 4;
else if (strcmp(cap, "cap_kill") == 0)
return 5;
else if (strcmp(cap, "cap_setgid") == 0)
return 6;
else if (strcmp(cap, "cap_setuid") == 0)
return 7;
else if (strcmp(cap, "cap_setpcap") == 0)
return 8;
else if (strcmp(cap, "cap_linux_immutable") == 0)
return 9;
else if (strcmp(cap, "cap_net_bind_service") == 0)
return 10;
else if (strcmp(cap, "cap_net_broadcast") == 0)
return 11;
else if (strcmp(cap,

[PATCH 1/2 -mm] capabilities: define CONFIG_COMMONCAP

2007-10-03 Thread Serge E. Hallyn

>From 54c70ca7671750fe8986451fae91d42107d0ca90 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <[EMAIL PROTECTED]>
Date: Fri, 28 Sep 2007 10:33:33 -0500
Subject: [PATCH 1/2 -mm] capabilities: define CONFIG_COMMONCAP

currently the compilation of commoncap.c is determined
through Makefile logic.  So there is no single CONFIG
variable which can be relied upon to know whether it
will be compiled.

Define CONFIG_COMMONCAP to be true when lsm is not
compiled in, or when the capability or rootplug modules
are compiled.  These are the cases when commoncap is
currently compiled.  Use this variable in security/Makefile
to determine commoncap.c's compilation.

Apart from being a logic cleanup, this is needed by the
upcoming cap_bset patch so that prctl can know whether
PR_SET_BSET should be allowed.

Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]>
---
 security/Kconfig  |4 
 security/Makefile |9 +++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/security/Kconfig b/security/Kconfig
index 8086e61..02b33fa 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -103,6 +103,10 @@ config SECURITY_ROOTPLUG
  
  If you are unsure how to answer this question, answer N.
 
+config COMMONCAP
+   bool
+   default !SECURITY || SECURITY_CAPABILITIES || SECURITY_ROOTPLUG
+
 source security/selinux/Kconfig
 
 endmenu
diff --git a/security/Makefile b/security/Makefile
index ef87df2..781 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -5,14 +5,11 @@
 obj-$(CONFIG_KEYS) += keys/
 subdir-$(CONFIG_SECURITY_SELINUX)  += selinux
 
-# if we don't select a security model, use the default capabilities
-ifneq ($(CONFIG_SECURITY),y)
-obj-y  += commoncap.o
-endif
+obj-$(CONFIG_COMMONCAP)+= commoncap.o
 
 # Object file lists
 obj-$(CONFIG_SECURITY) += security.o dummy.o inode.o
 # Must precede capability.o in order to stack properly.
 obj-$(CONFIG_SECURITY_SELINUX) += selinux/built-in.o
-obj-$(CONFIG_SECURITY_CAPABILITIES)+= commoncap.o capability.o
-obj-$(CONFIG_SECURITY_ROOTPLUG)+= commoncap.o root_plug.o
+obj-$(CONFIG_SECURITY_CAPABILITIES)+= capability.o
+obj-$(CONFIG_SECURITY_ROOTPLUG)+= root_plug.o
-- 
1.5.1.1.GIT

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Eric W. Biederman

Benjamin Herrenschmidt <[EMAIL PROTECTED]> writes:
>
> Well, yes and no ... A valid option here would be to use soft-masking,
> which is possible because MSIs are edge interrupts. That is, basically,
> when masked, just ignore them and set IRQF_PENDING, and when unmasked,
> replay (which can be done with softirq if there is no HW mechanism for
> that). The genirq code contains all the necessary infrastructure for
> doing that stuff, it's fairly trivial, and would probably avoid stepping
> in HW lalaland (how much do you bet HW generally get that masking thing
> wrong ?)

Well.  If people actually use it I suspect it will work ok.  The
circuitry is quite simple so as long as people get their requirements
straight we should be fine.  Which is why I tried to get everything
working as well as we could sooner rather then later.  Of course
drivers are free not to call anything that would cause the irq
to be masked.

That said the current disable_irq and enable_irq path is using the
IRQF_PENDING infrastructure on x86.  So the only time this comes up
is for irq migration.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Version 4 (2.6.23-rc8-mm2) Smack: Simplified Mandatory Access Control Kernel

2007-10-03 Thread Casey Schaufler

--- Al Viro <[EMAIL PROTECTED]> wrote:

> On Wed, Oct 03, 2007 at 12:51:08PM -0700, Casey Schaufler wrote:
> > > > Because you throw "simple" out the window when you require userland
> > > > assistance to perform this function.
> > > 
> > > Any more than having /tmp replaced with a symlink?
> > 
> > Yes. By the way, there's nothing that really requires that you
> > use a /smack symlink if you don't want to. /tmp can still be a
> > real directory, a mount point, a symlink to /var/tmp, or whatever
> > else you want it to be if that suits your needs better. For the
> > simplest scenarios /tmp -> /smack/tmp -> /moldy/ has every
> > other scheme I've seen throughly beaten.
> 
> And your point is?  If you don't use it, you get exact same complexity
> in both setups.

Thank you for your patience. Let me see if I can get my point across.

The intended Smack scenario:

1. Create /moldy at "_"
2. For each label you care about
   2a. Create /moldy/
   2b. Set the label of /moldy/ to 
3. ln -s /smack/tmp /tmp

All processes are now redirected into the appropriate place
regardless of how they come into being. It doesn't matter if
the "session" starts from busybox, login, sshd, xdm, crontab,
or out of an init script.

> > > _What_ userland intervention?  Mounting stuff under /smack/tmp and not
> under
> > > your /moldy?
> > 
> > Who said anything about mounting under /moldy? I never did.
> 
> Sigh...  So put the binding into fstab and be done with that.

Are you suggesting that /smack/tmp.link below is a mount point,
and that appropriate directories get mounted there? 

1. Create /moldy at "_"
2. For each label you care about
   2a. Create /moldy/
   2b. Set the label of /moldy/ to 
   2c. mount --bind /moldy/ /smack/tmp.link/
3. ln -s /smack/tmp.link /tmp

> > > Having /tmp replaced with symlink to /smack/tmp.link instead
> > > of replacing it with a symlink to /smack/tmp?
> > > 
> > > Absolute paths in that kind of thing are _wrong_.  You know where the
> things
> > > are on your fs.  You don't know if anything else will be visible, let
> alone
> > > whether it will be at the same place in all chroots or namespaces.  And
> no,
> > > you _can't_ make sure that fs is visible only in one place.  No fs can or
> > > has any business even trying.
> > 
> > Is the objection that there is a default value coded in?
> 
> Right now the main objection is about your lack of ability to read.

Now you sound like my daughter. :-)

> Which
> part of "it can be mounted in different chroots/namespaces, therefore
> having absolute paths doesn't work" is too hard to understand?

It's the content of a symlink, and that can be just about anything
and is not required to point to anything, which is one reason why
I made that choice. If you don't have a /tmp, or can't write to the
/tmp that exists, or have a /tmp that's a dangling symlink under
any circumstances you may have an issue. That's true regardless of
the presence or absense of /smack. All of the traditional mechanisms
for dealing with /tmp in a chrooted or namespaced environment remain.

> No, it's not about having a default.

Nuts. That would have made addressing your concern easy.

> It's about keeping an absolute pathname in virtual fs,

It's in a symlink on the filesystem, and it doesn't have to be an
absolute pathname, although since it's a symlink and the semantics
for a symlink allow that be be absolute, relative, or dangling I
don't see any reason to restrict it from being absolute.

> having all instances autosoddingmatically sharing it _and_
> having change attempt in any instance automatically affect all of them.
> If you have that kind of sharing, don't pretend that your mechanism really
> allows absolute pathnames.

Could allowing multiple distinct mounts and symlink assignments
of /smackfs address those issues? I think it would, but as you pointed
out earlier, my lack of ability to read may be clouding my understanding.

Thank you.

Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] usbhid: report descriptor fix for MacBook JIS keyboard

2007-10-03 Thread Jiri Kosina

On Thu, 4 Oct 2007, Tomoya Adachi wrote:

> This patch fixes the problem, that Japanese MacBook doesn't recognize 
> some keys like '\'(yen, or backslash), '|'(pipe), and '_'(underscore). 
> It is due to that MacBook JIS keyboard (jp106) sends wrong report 
> descriptor. It saids "logical maximum = 0x65", so Keyboard.0089 is 
> mapped to Key.Unknown, while it should be accepted as Key.Yen.

Hi Tomoya,

thanks a lot for debugging and fixing this. I have applied your patch to 
my tree.

-- 
Jiri Kosina
SUSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net: fix race in process_backlog

2007-10-03 Thread Stephen Hemminger

On Wed, 03 Oct 2007 14:58:07 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

> From: Peter Zijlstra <[EMAIL PROTECTED]>
> Date: Wed, 03 Oct 2007 17:44:53 +0200
> 
> > Index: linux-2.6/net/core/dev.c
> > ===
> > --- linux-2.6.orig/net/core/dev.c
> > +++ linux-2.6/net/core/dev.c
> > @@ -2095,11 +2095,11 @@ static int process_backlog(struct napi_s
> >  
> > local_irq_disable();
> > skb = __skb_dequeue(>input_pkt_queue);
> > -   local_irq_enable();
> > if (!skb) {
> > -   napi_complete(napi);
> > +   __napi_complete(napi);
> > break;
> > }
> > +   local_irq_enable();
> 
> What re-enables interrupts in the !skb path?

This looks like a better fix. the irq_enable is needed in both cases.

--- a/net/core/dev.c2007-09-27 07:19:10.0 -0700
+++ b/net/core/dev.c2007-10-03 15:03:54.0 -0700
@@ -2077,12 +2077,14 @@ static int process_backlog(struct napi_s
 
local_irq_disable();
skb = __skb_dequeue(>input_pkt_queue);
-   local_irq_enable();
if (!skb) {
-   napi_complete(napi);
+   __napi_complete(napi);
+   local_irq_enable();
break;
}
 
+   local_irq_enable();
+
dev = skb->dev;
 
netif_receive_skb(skb);


-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Benjamin Herrenschmidt


> We should also be leaving the INTx irqs disabled.  So no irq
> should be generated.
> 
> If you have a mask bit implemented you are required to be
> able to refire it after the msi is enabled.  I don't recall
> the requirements for when both intx and msi irqs are both
> disabled.  Intuitively I would expect no irq message to
> be generated, and at most the card would need to be polled
> manually to recognize a device event happened.
> 
> Certainly firing an irq and having it get completely lost is
> unfortunate, and a major pain if you are trying to use the
> card.
> 
> As for the previous no-op behavior that was a bug.

Well, yes and no ... A valid option here would be to use soft-masking,
which is possible because MSIs are edge interrupts. That is, basically,
when masked, just ignore them and set IRQF_PENDING, and when unmasked,
replay (which can be done with softirq if there is no HW mechanism for
that). The genirq code contains all the necessary infrastructure for
doing that stuff, it's fairly trivial, and would probably avoid stepping
in HW lalaland (how much do you bet HW generally get that masking thing
wrong ?)

> The PCI spec requires disabling/masking the msi when reprogramming it.
> So as a general rule we can not do better.  Further because we are
> writing to multiple pci config registers the only way we can safely
> reprogram the message is with the msi disabled/masked on the card in
> some fashion.

Hrm... all right, that will be an issue, so migration need a real
masking.

> I suspect what needs to happen is a spec search to verify that the
> current linux behavior is at least reasonable within the spec.
> 
> Once we have verified that the generic code can not do better.
> We can look at work-arounds.   One possibility is for the generic
> code to provide some overrides for the methods for masking and
> reading/writing to a msi message.
> 
> I don't want to break anyones hardware, but at the same time I want us
> to be careful and in spec for the default case.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net: fix race in process_backlog

2007-10-03 Thread David Miller

From: Peter Zijlstra <[EMAIL PROTECTED]>
Date: Wed, 03 Oct 2007 17:44:53 +0200

> Index: linux-2.6/net/core/dev.c
> ===
> --- linux-2.6.orig/net/core/dev.c
> +++ linux-2.6/net/core/dev.c
> @@ -2095,11 +2095,11 @@ static int process_backlog(struct napi_s
>  
>   local_irq_disable();
>   skb = __skb_dequeue(>input_pkt_queue);
> - local_irq_enable();
>   if (!skb) {
> - napi_complete(napi);
> + __napi_complete(napi);
>   break;
>   }
> + local_irq_enable();

What re-enables interrupts in the !skb path?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] usbhid: report descriptor fix for MacBook JIS keyboard

2007-10-03 Thread Tomoya Adachi

From: Tomoya Adachi <[EMAIL PROTECTED]>

This patch fixes the problem, that Japanese MacBook doesn't recognize some keys
like '\'(yen, or backslash), '|'(pipe), and '_'(underscore).

It is due to that MacBook JIS keyboard (jp106) sends wrong report descriptor.
It saids "logical maximum = 0x65", so Keyboard.0089 is mapped to Key.Unknown,
while it should be accepted as Key.Yen.

Signed-off-by: Tomoya Adachi <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.23-rc9/Documentation/dontdiff 
linux-2.6.23-rc9/drivers/hid/usbhid/hid-quirks.c 
linux-2.6.23-rc9-patched/drivers/hid/usbhid/hid-quirks.c
--- linux-2.6.23-rc9/drivers/hid/usbhid/hid-quirks.c2007-10-04 
02:52:01.0 +0900
+++ linux-2.6.23-rc9-patched/drivers/hid/usbhid/hid-quirks.c2007-10-04 
06:03:45.0 +0900
@@ -619,6 +619,8 @@ static const struct hid_rdesc_blacklist 
{ USB_VENDOR_ID_CYPRESS, USB_DEVICE_ID_CYPRESS_BARCODE_1, 
HID_QUIRK_RDESC_SWAPPED_MIN_MAX },
{ USB_VENDOR_ID_CYPRESS, USB_DEVICE_ID_CYPRESS_BARCODE_2, 
HID_QUIRK_RDESC_SWAPPED_MIN_MAX },
 
+   { USB_VENDOR_ID_APPLE, USB_DEVICE_ID_APPLE_GEYSER4_JIS, 
HID_QUIRK_RDESC_MACBOOK_JIS },
+
{ 0, 0 }
 };
 
@@ -927,6 +929,18 @@ static void usbhid_fixup_cypress_descrip
printk(KERN_INFO "Fixing up Cypress report descriptor\n");
 }
 
+/*
+ * MacBook JIS keyboard has wrong logical maximum
+ */
+static void usbhid_fixup_macbook_descriptor(unsigned char *rdesc, int rsize)
+{
+   if (rsize >= 60 && rdesc[53] == 0x65
+   && rdesc[59] == 0x65) {
+   printk(KERN_INFO "Fixing up MacBook JIS keyboard report 
descriptor\n");
+   rdesc[53] = rdesc[59] = 0xe7;
+   }
+}
+
 
 static void __usbhid_fixup_report_descriptor(__u32 quirks, char *rdesc, 
unsigned rsize)
 {
@@ -941,6 +955,9 @@ static void __usbhid_fixup_report_descri
 
if (quirks & HID_QUIRK_RDESC_PETALYNX)
usbhid_fixup_petalynx_descriptor(rdesc, rsize);
+
+   if (quirks & HID_QUIRK_RDESC_MACBOOK_JIS)
+   usbhid_fixup_macbook_descriptor(rdesc, rsize);
 }
 
 /**
diff -uprN -X linux-2.6.23-rc9/Documentation/dontdiff 
linux-2.6.23-rc9/include/linux/hid.h 
linux-2.6.23-rc9-patched/include/linux/hid.h
--- linux-2.6.23-rc9/include/linux/hid.h2007-10-04 02:52:03.0 
+0900
+++ linux-2.6.23-rc9-patched/include/linux/hid.h2007-10-04 
06:00:57.0 +0900
@@ -285,6 +285,7 @@ struct hid_item {
 #define HID_QUIRK_RDESC_LOGITECH   0x0002
 #define HID_QUIRK_RDESC_SWAPPED_MIN_MAX0x0004
 #define HID_QUIRK_RDESC_PETALYNX   0x0008
+#define HID_QUIRK_RDESC_MACBOOK_JIS0x0010
 
 /*
  * This is the global environment of the parser. This information is
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Eric W. Biederman

Loic Prylli <[EMAIL PROTECTED]> writes:

> Hi,
>
> We observe a problem with MSI since kernel 2.6.21 where interrupts would
> randomly stop working. We have tracked it down to the new
> msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
> providing a "native" MSI mask, it was a no-op before, and now it
> disables MSI in the MSI-ctl register which according to the PCI spec is
> interpreted as reverting the device to legacy interrupts. If such a
> device try to generate a new interrupt during the "masked" window, the
> device will try a legacy interrupt which is generally
> ignored/never-acked and cause interrupts to no longer work for the
> device/driver combination (even after the enable bit is restored).

We should also be leaving the INTx irqs disabled.  So no irq
should be generated.

If you have a mask bit implemented you are required to be
able to refire it after the msi is enabled.  I don't recall
the requirements for when both intx and msi irqs are both
disabled.  Intuitively I would expect no irq message to
be generated, and at most the card would need to be polled
manually to recognize a device event happened.

Certainly firing an irq and having it get completely lost is
unfortunate, and a major pain if you are trying to use the
card.

As for the previous no-op behavior that was a bug.

> Is there anything apart from irq migration that strongly requires
> masking? Is is possible to do the irq migration without masking?

enable_irq/disable_irq.  Although we can get away with a software
emulation there and those are only needed if the driver calls them.

The PCI spec requires disabling/masking the msi when reprogramming it.
So as a general rule we can not do better.  Further because we are
writing to multiple pci config registers the only way we can safely
reprogram the message is with the msi disabled/masked on the card in
some fashion.

I suspect what needs to happen is a spec search to verify that the
current linux behavior is at least reasonable within the spec.

Once we have verified that the generic code can not do better.
We can look at work-arounds.   One possibility is for the generic
code to provide some overrides for the methods for masking and
reading/writing to a msi message.

I don't want to break anyones hardware, but at the same time I want us
to be careful and in spec for the default case.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/8] scsi: megaraid_sas - add module param max_sectors, cmd_per_lun

2007-10-03 Thread Randy Dunlap

On Mon, 01 Oct 2007 11:51:48 -0400 bo yang wrote:

> Adding module parameters to configure max sectors per request & # of cmds per 
> lun.
> 
> Signed-off-by: Bo Yang <[EMAIL PROTECTED]>
> 
> ---
>  drivers/scsi/megaraid/megaraid_sas.c |   68 -
>  drivers/scsi/megaraid/megaraid_sas.h |2
>  2 files changed, 68 insertions(+), 2 deletions(-)
> 
> diff -uprN linux-2.6.22_orig/drivers/scsi/megaraid/megaraid_sas.c 
> linux-2.6.22_new/drivers/scsi/megaraid/megaraid_sas.c
> --- linux-2.6.22_orig/drivers/scsi/megaraid/megaraid_sas.c2007-10-01 
> 00:14:29.0 -0700
> +++ linux-2.6.22_new/drivers/scsi/megaraid/megaraid_sas.c 2007-10-01 
> 02:15:16.0 -0700
> @@ -62,6 +62,23 @@ MODULE_PARM_DESC(fast_load,
>   "megasas: Faster loading of the driver, skips physical devices! "\
>   "(default = 0)");
>  
> +/*
> + * Number of sectors per IO command will be set in megasas_init_mfi
> + * if user does not provide
> + */
> +static unsigned int max_sectors;
> +module_param_named(max_sectors, max_sectors, int, 0);
> +MODULE_PARM_DESC(max_sectors,
> + "Maximum number of sectors per IO command");

Are you sure that you want these parameters hidden (permission = 0)
instead of readable via sysfs?

(same applies to the fast_load parameter patch also)


> +/*
> + * Number of cmds per logical unit
> + */
> +static unsigned int cmd_per_lun = MEGASAS_DEFAULT_CMD_PER_LUN;
> +module_param_named(cmd_per_lun, cmd_per_lun, int, 0);
> +MODULE_PARM_DESC(cmd_per_lun,
> + "Maximum number of commands per logical unit (default=128)");
> +
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(MEGASAS_VERSION);
>  MODULE_AUTHOR("[EMAIL PROTECTED]");


---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] hugetlb: Fix pool resizing corner case V2

2007-10-03 Thread Adam Litke


Changes in V2:
 - Removed now unnecessary check as suggested by Ken Chen

When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
are careful to keep enough pages around to satisfy reservations.  But the
calculation is flawed for the following scenario:

Action  Pool Counters (Total, Free, Resv)
==  =
Set pool to 1 page  1 1 0
Map 1 page MAP_PRIVATE  1 1 0
Touch the page to fault it in   1 0 0
Set pool to 3 pages 3 2 0
Map 2 pages MAP_SHARED  3 2 2
Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2
Touch the 2 shared pages2 0 1 <-- Program crashes here

The last touch above will terminate the process due to lack of huge pages.

This patch corrects the calculation so that it factors in pages being used
for private mappings.  Andrew, this is a standalone fix suitable for
mainline.  It is also now corrected in my latest dynamic pool resizing
patchset which I will send out soon.

Signed-off-by: Adam Litke <[EMAIL PROTECTED]>
Acked-by: Ken Chen <[EMAIL PROTECTED]>
---

 mm/hugetlb.c |8 +++-
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 84c795e..b6b3b64 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -224,14 +224,14 @@ static void try_to_free_low(unsigned long count)
for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
list_for_each_entry_safe(page, next, _freelists[i], 
lru) {
+   if (count >= nr_huge_pages)
+   return;
if (PageHighMem(page))
continue;
list_del(>lru);
update_and_free_page(page);
free_huge_pages--;
free_huge_pages_node[page_to_nid(page)]--;
-   if (count >= nr_huge_pages)
-   return;
}
}
 }
@@ -247,11 +247,9 @@ static unsigned long set_max_huge_pages(unsigned long 
count)
if (!alloc_fresh_huge_page())
return nr_huge_pages;
}
-   if (count >= nr_huge_pages)
-   return nr_huge_pages;
 
spin_lock(_lock);
-   count = max(count, resv_huge_pages);
+   count = max(count, resv_huge_pages + nr_huge_pages - free_huge_pages);
try_to_free_low(count);
while (count < nr_huge_pages) {
struct page *page = dequeue_huge_page(NULL, 0);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm patch] unexport noautodma

2007-10-03 Thread Bartlomiej Zolnierkiewicz

On Sunday 09 September 2007, Adrian Bunk wrote:
> noautodma can now be unexported.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

applied
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: pgd_none_or_clear_bad strangeness?

2007-10-03 Thread Matt Mackall

On Wed, Oct 03, 2007 at 07:18:23PM +0100, Hugh Dickins wrote:
> On Wed, 3 Oct 2007, Nick Piggin wrote:
> > On Tue, Oct 02, 2007 at 05:20:03PM -0500, Matt Mackall wrote:
> > > In lib/pagewalk.c, I've been using the various forms of
> > > {pgd,pud,pmd}_none_or_clear_bad while walking page tables as that
> > > seemed the canonical way to do things. Lately (eg with -rc7-mm1),
> > > these have been triggering messages like "bad pgd 0x01e3" and causing
> > > nasty double faults. It appears this is actually triggered at the pmd
> > > level (mm/memory.c:116), though it appears to produce the wrong
> > > message.
> 
> I guess the "wrong message" is an artifact of pud/pmd folding;
> but I get too confused by the different levels myself to want to
> think more about it - I'll just assume it's "right" somehow ;)
> 
> > > 
> > > Has something changed here? I'm pretty sure this used to work! Is this
> 
> I don't know of anything changing here, sorry.
> 
> > > not a kosher thing to do? Does it make any sense I'd repeatedly run
> > > into a bad pmd in the middle of bash's page table right after boot?
> > > The simple _none variant seems to work, but I worry that it's papering
> > > over a real problem.
> > 
> > No, I think that should be the right thing to do for userspace pages.
> > You're not walking into a hugetlb area or a kernel mapping are you?
> > (the bad pgd: line could be important... 0x01e3 would be a linear kernel
> > mapping I think?).
> 
> I should have spent more time reading Nick's reply and less time trying
> to work it out for myself!  Yes, that's the conclusion I came to, for
> some reason you're now going beyond the user vmas and walking into the
> linear kernel mapping, which has _PAGE_GLOBAL and _PAGE_PSE bits set.

Indeed, that's precisely what's happening. I'm walking one page past
the end of userspace. 

And the reason is I changed my walker from using for loops to do/while
loops at Nick's insistance, so start==end no longer gets noticed
immediately. This also explains why the bug doesn't manifest in
lguest: no PSE mappings.

Thanks, guys!

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] IrCOMM discovery indication simplification

2007-10-03 Thread Andrew Morton

On Mon, 1 Oct 2007 02:29:51 +0300
Samuel Ortiz <[EMAIL PROTECTED]> wrote:

> Hi Andrew,
> 
> Every IrCOMM socket is registered with the discovery subsystem, so we don't
> need to loop over all of them for every discovery event. We just need to
> do it for the registered IrCOMM socket.
> 
> Would you please consider this patch for -mm inclusion ?

Sure.  I don't merge ircomm patches directly so I added this to my
to-send-to-davem queue.

> From: Ryan Reading <[EMAIL PROTECTED]>
> Signed-off-by: Samuel Ortiz <[EMAIL PROTECTED]>

Please put the From: right at the start of the changelog, not at the end
like this, thanks.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] intel-iommu sg chaining support

2007-10-03 Thread Keshavamurthy, Anil S

On Wed, Oct 03, 2007 at 02:12:03PM -0700, Andrew Morton wrote:
> On Mon, 1 Oct 2007 09:12:56 -0700
> "Keshavamurthy, Anil S" <[EMAIL PROTECTED]> wrote:
> 
> > On Sat, Sep 29, 2007 at 05:16:38AM -0700, FUJITA Tomonori wrote:
> > > 
> > >x86_64 defines ARCH_HAS_SG_CHAIN. So if IOMMU implementations don't
> > >support sg chaining, we will get data corruption.
> > >Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
> > 
> > Acked-by: Anil S Keshavamurthy <[EMAIL PROTECTED]>
> > 
> 
> Am I correct in believing that this patch is needed only when the
> chaining patches which are presently in git-block are combined with the
> intel-iommu work which is presently in -mm?
Yes, that is correct. 

-Anil
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch take 2][Intel-IOMMU] Fix for IOMMU early crash

2007-10-03 Thread Keshavamurthy, Anil S

Subject: [patch][Intel-IOMMU] Fix for IOMMU early crash

pci_dev's->sysdata is highly overloaded and currently
IOMMU is broken due to IOMMU code depending on this field.

This patch introduces new field in pci_dev's struct to
hold IOMMU specific per device IOMMU private data.

Signed-off-by: Anil S Keshavamurthy <[EMAIL PROTECTED]>

---
 drivers/pci/intel-iommu.c |   22 +++---
 include/linux/pci.h   |1 +
 2 files changed, 12 insertions(+), 11 deletions(-)

Index: 2.6-mm/drivers/pci/intel-iommu.c
===
--- 2.6-mm.orig/drivers/pci/intel-iommu.c   2007-10-03 13:48:18.0 
-0700
+++ 2.6-mm/drivers/pci/intel-iommu.c2007-10-03 13:48:41.0 -0700
@@ -1348,7 +1348,7 @@
list_del(>link);
list_del(>global);
if (info->dev)
-   info->dev->sysdata = NULL;
+   info->dev->iommu_private = NULL;
spin_unlock_irqrestore(_domain_lock, flags);
 
detach_domain_for_dev(info->domain, info->bus, info->devfn);
@@ -1361,7 +1361,7 @@
 
 /*
  * find_domain
- * Note: we use struct pci_dev->sysdata stores the info
+ * Note: we use struct pci_dev->iommu_private stores the info
  */
 struct dmar_domain *
 find_domain(struct pci_dev *pdev)
@@ -1369,7 +1369,7 @@
struct device_domain_info *info;
 
/* No lock here, assumes no domain exit in normal case */
-   info = pdev->sysdata;
+   info = pdev->iommu_private;
if (info)
return info->domain;
return NULL;
@@ -1519,7 +1519,7 @@
}
list_add(>link, >devices);
list_add(>global, _domain_list);
-   pdev->sysdata = info;
+   pdev->iommu_private = info;
spin_unlock_irqrestore(_domain_lock, flags);
return domain;
 error:
@@ -1579,7 +1579,7 @@
 static inline int iommu_prepare_rmrr_dev(struct dmar_rmrr_unit *rmrr,
struct pci_dev *pdev)
 {
-   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+   if (pdev->iommu_private == DUMMY_DEVICE_DOMAIN_INFO)
return 0;
return iommu_prepare_identity_map(pdev, rmrr->base_address,
rmrr->end_address + 1);
@@ -1595,7 +1595,7 @@
int ret;
 
for_each_pci_dev(pdev) {
-   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO ||
+   if (pdev->iommu_private == DUMMY_DEVICE_DOMAIN_INFO ||
!IS_GFX_DEVICE(pdev))
continue;
printk(KERN_INFO "IOMMU: gfx device %s 1-1 mapping\n",
@@ -1836,7 +1836,7 @@
int prot = 0;
 
BUG_ON(dir == DMA_NONE);
-   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+   if (pdev->iommu_private == DUMMY_DEVICE_DOMAIN_INFO)
return virt_to_bus(addr);
 
domain = get_valid_domain_for_dev(pdev);
@@ -1900,7 +1900,7 @@
unsigned long start_addr;
struct iova *iova;
 
-   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+   if (pdev->iommu_private == DUMMY_DEVICE_DOMAIN_INFO)
return;
domain = find_domain(pdev);
BUG_ON(!domain);
@@ -1974,7 +1974,7 @@
size_t size = 0;
void *addr;
 
-   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+   if (pdev->iommu_private == DUMMY_DEVICE_DOMAIN_INFO)
return;
 
domain = find_domain(pdev);
@@ -2032,7 +2032,7 @@
unsigned long start_addr;
 
BUG_ON(dir == DMA_NONE);
-   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+   if (pdev->iommu_private == DUMMY_DEVICE_DOMAIN_INFO)
return intel_nontranslate_map_sg(hwdev, sg, nelems, dir);
 
domain = get_valid_domain_for_dev(pdev);
@@ -2234,7 +2234,7 @@
for (i = 0; i < drhd->devices_cnt; i++) {
if (!drhd->devices[i])
continue;
-   drhd->devices[i]->sysdata = DUMMY_DEVICE_DOMAIN_INFO;
+   drhd->devices[i]->iommu_private = 
DUMMY_DEVICE_DOMAIN_INFO;
}
}
 }
Index: 2.6-mm/include/linux/pci.h
===
--- 2.6-mm.orig/include/linux/pci.h 2007-10-03 13:48:20.0 -0700
+++ 2.6-mm/include/linux/pci.h  2007-10-03 13:49:08.0 -0700
@@ -195,6 +195,7 @@
 #ifdef CONFIG_PCI_MSI
struct list_head msi_list;
 #endif
+   void*iommu_private; /* hook for IOMMU specific extension */
 };
 
 extern struct pci_dev *alloc_pci_dev(void);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23-rc7-mm1 AHCI ATA errors -- won't boot

2007-10-03 Thread Jeff Garzik


Berck E. Nash wrote:

Greetings,

I get a few million of these on boot-- the system never actually boots.
Works fine in 2.6.23-rc7.

[   50.456012] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[   50.462484] ata2.00: irq_stat 0x4001
[   50.466441] ata2.00: cmd e5/00:00:00:00:00/00:00:00:00:00/a0 tag 0
cdb 0x0 data 0
[   50.466442]  res 51/04:00:01:01:80/00:00:00:00:00/a0 Emask
0x1 (device error)
[   50.481914] ata2.00: status: {DRDY ERR }
[   50.485876] ata2.00: error: {ABRT }
[   50.489533] ata2.00: configured for UDMA/133
[   50.493839] ata2: EH complete


FWIW I haven't had time to debug this, so I'm going to simply revert the 
patch, and make sure it does not make it into 2.6.24.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ANNOUNCE] GIT 1.5.3.4

2007-10-03 Thread Junio C Hamano

The latest maintenance release GIT 1.5.3.4 is available at the
usual places:

  http://www.kernel.org/pub/software/scm/git/

  git-1.5.3.4.tar.{gz,bz2}  (tarball)
  git-htmldocs-1.5.3.4.tar.{gz,bz2} (preformatted docs)
  git-manpages-1.5.3.4.tar.{gz,bz2} (preformatted docs)
  RPMS/$arch/git-*-1.5.3.4-1.$arch.rpm  (RPM)

GIT v1.5.3.4 Release Notes
==

Fixes since v1.5.3.3


 * Change to "git-ls-files" in v1.5.3.3 that was introduced to support
   partial commit of removal better had a segfaulting bug, which was
   diagnosed and fixed by Keith and Carl.

 * Performance improvements for rename detection has been backported
   from the 'master' branch.

 * "git-for-each-ref --format='%(numparent)'" was not working
   correctly at all, and --format='%(parent)' was not working for
   merge commits.

 * Sample "post-receive-hook" incorrectly sent out push
   notification e-mails marked as "From: " the committer of the
   commit that happened to be at the tip of the branch that was
   pushed, not from the person who pushed.

 * "git-remote" did not exit non-zero status upon error.

 * "git-add -i" did not respond very well to EOF from tty nor
   bogus input.

 * "git-rebase -i" squash subcommand incorrectly made the
   author of later commit the author of resulting commit,
   instead of taking from the first one in the squashed series.

 * "git-stash apply --index" was not documented.

 * autoconfiguration learned that "ar" command is found as "gas" on
   some systems.



Changes since v1.5.3.3 are as follows:

Andy Parkins (1):
  post-receive-hook: Remove the From field from the generated email header 
so that the pusher's name is used

Carl Worth (1):
  Add test case for ls-files --with-tree

Federico Mena Quintero (4):
  Say when --track is useful in the git-checkout docs.
  Add documentation for --track and --no-track to the git-branch docs.
  Note that git-branch will not automatically checkout the new branch
  Make git-pull complain and give advice when there is nothing to merge with

Jari Aalto (1):
  git-remote: exit with non-zero status after detecting errors.

Jean-Luc Herren (2):
  git-add--interactive: Allow Ctrl-D to exit
  git-add--interactive: Improve behavior on bogus input

Jeff King (1):
  diffcore-rename: cache file deltas

Johan Herland (1):
  Mention 'cpio' dependency in INSTALL

Johannes Schindelin (2):
  rebase -i: squash should retain the authorship of the _first_ commit
  Fix typo in config.txt

Junio C Hamano (5):
  Whip post 1.5.3.3 maintenance series into shape.
  git-commit: initialize TMP_INDEX just to be sure.
  for-each-ref: fix %(numparent) and %(parent)
  rename diff_free_filespec_data_large() to diff_free_filespec_blob()
  GIT 1.5.3.4

Keith Packard (1):
  Must not modify the_index.cache as it may be passed to realloc at some 
point.

Miklos Vajna (1):
  git stash: document apply's --index switch

Robert Schiele (1):
  the ar tool is called gar on some systems

Steffen Prohaska (1):
  fixed link in documentation of diff-options

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] intel-iommu sg chaining support

2007-10-03 Thread Andrew Morton

On Mon, 1 Oct 2007 09:12:56 -0700
"Keshavamurthy, Anil S" <[EMAIL PROTECTED]> wrote:

> On Sat, Sep 29, 2007 at 05:16:38AM -0700, FUJITA Tomonori wrote:
> > 
> >x86_64 defines ARCH_HAS_SG_CHAIN. So if IOMMU implementations don't
> >support sg chaining, we will get data corruption.
> >Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
> 
> Acked-by: Anil S Keshavamurthy <[EMAIL PROTECTED]>
> 

Am I correct in believing that this patch is needed only when the
chaining patches which are presently in git-block are combined with the
intel-iommu work which is presently in -mm?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Loic Prylli

Hi,

We observe a problem with MSI since kernel 2.6.21 where interrupts would
randomly stop working. We have tracked it down to the new
msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
providing a "native" MSI mask, it was a no-op before, and now it
disables MSI in the MSI-ctl register which according to the PCI spec is
interpreted as reverting the device to legacy interrupts. If such a
device try to generate a new interrupt during the "masked" window, the
device will try a legacy interrupt which is generally
ignored/never-acked and cause interrupts to no longer work for the
device/driver combination (even after the enable bit is restored).


Is there anything apart from irq migration that strongly requires
masking? Is is possible to do the irq migration without masking?



Loic

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Version 4 (2.6.23-rc8-mm2) Smack: Simplified Mandatory Access Control Kernel

2007-10-03 Thread Casey Schaufler


--- Alan Cox <[EMAIL PROTECTED]> wrote:

> > An embedded system that does not have user logins but that does
> > have applications that require separation, perhaps a moble communication
> > device with application download capability, is just one example
> > where the smack symlink implementation provides the required
> > function without requiring application support.
> 
> I don't see what is such a problem here. For your mobile example you'd
> give the application download side its own /tmp via mount.
> 
> Its actually better this is done in user space as its more flexible that
> way and can be tuned arbitarily to meet interesting or bizarre real world
> cases.

I admit to being impressed by the wide variety of mount options
currently available. In many cases this will be the best approach.
/tmp is a typical use for a smack symlink, but not the only one.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Samba] 2.6.22/realtek bug in hardware, any kernel work-around?

2007-10-03 Thread Francois Romieu

Justin Piszcz <[EMAIL PROTECTED]> :
[...]

The bug is fixed in 2.6.23-rc9. Try it.

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/8] scsi: megaraid_sas - Update version and changelog

2007-10-03 Thread bo yang

Update version and changelog. 
Updated "LSI Logic" to new name "LSI"

Signed-off-by: Bo Yang <[EMAIL PROTECTED]>

---
 Documentation/scsi/ChangeLog.megaraid_sas |  160 
 drivers/scsi/megaraid/megaraid_sas.c  |   10 -
 drivers/scsi/megaraid/megaraid_sas.h  |8 -
 3 files changed, 169 insertions(+), 9 deletions(-)

diff -uprN linux-2.6.22_orig/Documentation/scsi/ChangeLog.megaraid_sas 
linux-2.6.22_new/Documentation/scsi/ChangeLog.megaraid_sas
--- linux-2.6.22_orig/Documentation/scsi/ChangeLog.megaraid_sas 2007-10-01 
00:03:59.0 -0700
+++ linux-2.6.22_new/Documentation/scsi/ChangeLog.megaraid_sas  2007-10-01 
00:03:59.0 -0700
@@ -1,3 +1,163 @@
+1 Release Date: Thur. Sep. 27 10:09:32 PDT 2007 -
+   (emaild-id:[EMAIL PROTECTED])
+   Sumant Patro
+   Bo Yang 
+
+2 Current Version : 00.00.03.16-rc1
+3 Older Version   : 00.00.03.15
+
+1. Increased MFI_POLL_TIMEOUT_SECS to 60 seconds from 10. FW may take a max of 
60 seconds to 
+   respond to the INIT cmd.
+
+1 Release Date: Fri. Sep. 07 16:30:43 PST 2007 -
+   (emaild-id:[EMAIL PROTECTED])
+   Sumant Patro
+   Bo Yang 
+
+2 Current Version : 00.00.03.15
+3 Older Version   : 00.00.03.14
+
+1. Added module parameter "poll_mode_io" to support for "polling" (reduced 
interrupt operation).
+   In this mode, IO completion interrupts are delayed. At the end of 
initiating IOs,
+   the driver schedules for cmd completion if there are pending cmds to be 
completed.
+   A timer-based interrupt has also been added to prevent IO completion 
processing from
+   being delayed indefinitely in the case that no new IOs are initiated.
+
+1 Release Date: Fri. Sep. 07 16:30:43 PST 2007 -
+   (emaild-id:[EMAIL PROTECTED])
+   Sumant Patro
+   Bo Yang 
+
+2 Current Version : 00.00.03.14
+3 Older Version   : 00.00.03.13
+
+1. Setting the max_sectors_per_req based on max SGL supported by the FW. Prior 
versions calculated 
+   this value from controller info (max_sectors_1, max_sectors_2). For 
certain controllers/FW,
+   this was resulting in a value greater than max SGL supported by the FW. 
Issue was first
+   reported by users running LUKS+XFS with megaraid_sas.
+   Thanks to RB for providing the logs and duplication steps that helped 
to get to the root 
+   cause of the issue.
+2. Increased MFI_POLL_TIMEOUT_SECS to 60 seconds from 10. FW may take a max of 
60 seconds to 
+   respond to the INIT cmd.
+
+1 Release Date: Fri. June. 15 16:30:43 PST 2007 -
+   (emaild-id:[EMAIL PROTECTED])
+   Sumant Patro
+   Bo Yang 
+
+2 Current Version : 00.00.03.13
+3 Older Version   : 00.00.03.12
+
+1. Added the megasas_reset_timer routine to intercept cmd timeout and throttle 
io.
+
+On Fri, 2007-03-16 at 16:44 -0600, James Bottomley wrote:
+It looks like megaraid_sas at least needs this to throttle its commands
+> as they begin to time out.  The code keeps the existing transport
+> template use of eh_timed_out (and allows the transport to override the
+> host if they both have this callback).
+> 
+> James
+
+1 Release Date: Sat May. 12 16:30:43 PST 2007 -
+   (emaild-id:[EMAIL PROTECTED])
+   Sumant Patro
+   Bo Yang 
+
+2 Current Version : 00.00.03.12
+3 Older Version   : 00.00.03.11
+
+1.  When MegaSAS driver receives reset call from OS, driver waits in reset
+routine for max 3 minutes for all pending command completion. Now driver will
+call completion routine every 5 seconds from the reset routine instead of
+waiting for depending on cmd completion from isr path.
+
+1 Release Date: Mon Apr. 30 10:25:52 PST 2007 -
+   (emaild-id:[EMAIL PROTECTED])
+   Sumant Patro
+   Bo Yang 
+
+2 Current Version : 00.00.03.11
+3 Older Version   : 00.00.03.09
+
+   1. Following module parameters added -
+   fast_load: Faster loading of the driver, skips physical devices 
scanning thereby
+   reducing the time to load driver.
+   cmd_per_lun: Maximum number of commands per logical unit
+   max_sectors: Maximum number of sectors per IO command
+   2. Memory Manager for IOCTL removed for 2.6 kernels.
+  pci_alloc_consistent replaced by dma_alloc_coherent. With this 
+  change there is no need of memory manager in the driver code
+
+   On Wed, 2007-02-07 at 13:30 -0800, Andrew Morton wrote:
+   > I suspect all this horror is due to stupidity in the DMA API.
+   >
+   > pci_alloc_consistent() just goes and assumes GFP_ATOMIC, whereas
+   > the caller (megasas_mgmt_fw_ioctl) would have been perfectly happy
+   > to use GFP_KERNEL.
+   >
+

[PATCH] RCU torture update for preemption

2007-10-03 Thread Steven Rostedt

Paul,

I ran your original preemption test of RCU torture, and after several
minutes, my preempt boost patch had one Preemption stall.  I then
disabled preemption boosting, and ran the preempt torture again, and it
seemed to never stall.  Something seemed strange, so I took a look.

Looks like you have a single thread that will run at max prio that runs
for 10 secs and then sleeps again. This thread seems to only push rcu
readers around. But it doesn't seem to do much else. That is a good test
to see if RCU readers can handle being pushed around, but it doesn't
test preemption boosting.

To do that, I modified the test to create CPUS-1 preempt boost hogs (or
1 if it is UP). But instead of putting it at max prio, I set it to the
lowest RT prio of 1. This way it's still at a higher priority than the
readers. I also switched the writers to run at 1+n where n increases for
every fake writer there is.

Without preempt boosting, after a couple of minutes I had 83 preemption
stalls.  When I turned my boosting back on, after several minutes (still
running as I type this) it has no preemption stalls.

This seems to be a good test for RCU preemption boosting.

-- Steve

PS. I got rid of your rcu_preeempt_task for rcu_preempt_tasks ;-)

(No the above is _not_ a typo)

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6.23-rc9-rt1/kernel/rcutorture.c
===
--- linux-2.6.23-rc9-rt1.orig/kernel/rcutorture.c
+++ linux-2.6.23-rc9-rt1/kernel/rcutorture.c
@@ -54,6 +54,7 @@ MODULE_AUTHOR("Paul E. McKenney rtort_rcu, rcu_torture_cb);
 }
 
-static struct task_struct *rcu_preeempt_task;
 static unsigned long rcu_torture_preempt_errors;
 
 static int rcu_torture_preempt(void *arg)
@@ -274,7 +276,7 @@ static int rcu_torture_preempt(void *arg
time_t gcstart;
struct sched_param sp;
 
-   sp.sched_priority = MAX_RT_PRIO - 1;
+   sp.sched_priority = 1;
err = sched_setscheduler(current, SCHED_RR, );
if (err != 0)
printk(KERN_ALERT "rcu_torture_preempt() priority err: %d\n",
@@ -297,24 +299,43 @@ static int rcu_torture_preempt(void *arg
 static long rcu_preempt_start(void)
 {
long retval = 0;
+   int i;
 
-   rcu_preeempt_task = kthread_run(rcu_torture_preempt, NULL,
-   "rcu_torture_preempt");
-   if (IS_ERR(rcu_preeempt_task)) {
-   VERBOSE_PRINTK_ERRSTRING("Failed to create preempter");
-   retval = PTR_ERR(rcu_preeempt_task);
-   rcu_preeempt_task = NULL;
+   rcu_preempt_tasks = kzalloc(nrealpreempthogs * 
sizeof(rcu_preempt_tasks[0]),
+   GFP_KERNEL);
+   if (rcu_preempt_tasks == NULL) {
+   VERBOSE_PRINTK_ERRSTRING("out of memory");
+   retval = -ENOMEM;
+   goto out;
}
+
+   for (i=0; i < nrealpreempthogs; i++) {
+   rcu_preempt_tasks[i] = kthread_run(rcu_torture_preempt, NULL,
+   "rcu_torture_preempt");
+   if (IS_ERR(rcu_preempt_tasks[i])) {
+   VERBOSE_PRINTK_ERRSTRING("Failed to create preempter");
+   retval = PTR_ERR(rcu_preempt_tasks[i]);
+   rcu_preempt_tasks[i] = NULL;
+   break;
+   }
+   }
+ out:
return retval;
 }
 
 static void rcu_preempt_end(void)
 {
-   if (rcu_preeempt_task != NULL) {
-   VERBOSE_PRINTK_STRING("Stopping rcu_preempt task");
-   kthread_stop(rcu_preeempt_task);
+   int i;
+   if (rcu_preempt_tasks) {
+   for (i=0; i < nrealpreempthogs; i++) {
+   if (rcu_preempt_tasks[i] != NULL) {
+   VERBOSE_PRINTK_STRING("Stopping rcu_preempt 
task");
+   kthread_stop(rcu_preempt_tasks[i]);
+   }
+   rcu_preempt_tasks[i] = NULL;
+   }
+   kfree(rcu_preempt_tasks);
}
-   rcu_preeempt_task = NULL;
 }
 
 static int rcu_preempt_stats(char *page)
@@ -613,10 +634,20 @@ rcu_torture_writer(void *arg)
 static int
 rcu_torture_fakewriter(void *arg)
 {
+   struct sched_param sp;
+   long id = (long) arg;
+   int err;
DEFINE_RCU_RANDOM(rand);
 
VERBOSE_PRINTK_STRING("rcu_torture_fakewriter task started");
-   set_user_nice(current, 19);
+   /*
+* Set up at a higher prio than the readers.
+*/
+   sp.sched_priority = 1 + id;
+   err = sched_setscheduler(current, SCHED_RR, );
+   if (err != 0)
+   printk(KERN_ALERT "rcu_torture_writer() priority err: %d\n",
+  err);
 
do {
schedule_timeout_uninterruptible(1 + rcu_random()%10);
@@ -849,9 +880,11 @@ rcu_torture_print_module_parms(char *tag
 {
printk(KERN_ALERT "%s" TORTURE_FLAG

[PATCH 7/8] scsi: megaraid_sas - support for poll_mode_io (reduced interrupt)

2007-10-03 Thread bo yang

Added module parameter "poll_mode_io" to support for "polling" (reduced 
interrupt operation). 
In this mode, IO completion interrupts are delayed. At the end of 
initiating IOs, the 
driver schedules for cmd completion if there are pending cmds. A 
timer-based interrupt 
has also been added to prevent IO completion from being delayed 
indefinitely in the 
case that no new IOs are initiated.  The user is expected to tune the 
interrupt 
throttling parameters using MegaRAID utility and then set poll_mode_io. 
Some formatting 
issues in resume, suspend comment block also corrected

Signed-off-by: Bo Yang <[EMAIL PROTECTED]>

---
 drivers/scsi/megaraid/megaraid_sas.c |  152 -
 drivers/scsi/megaraid/megaraid_sas.h |3
 2 files changed, 149 insertions(+), 6 deletions(-)

diff -uprN linux-2.6.22_orig/drivers/scsi/megaraid/megaraid_sas.c 
linux-2.6.22_new/drivers/scsi/megaraid/megaraid_sas.c
--- linux-2.6.22_orig/drivers/scsi/megaraid/megaraid_sas.c  2007-10-02 
23:33:02.0 -0700
+++ linux-2.6.22_new/drivers/scsi/megaraid/megaraid_sas.c   2007-10-02 
23:32:34.0 -0700
@@ -79,6 +79,14 @@ module_param_named(cmd_per_lun, cmd_per_
 MODULE_PARM_DESC(cmd_per_lun,
"Maximum number of commands per logical unit (default=128)");
 
+/*
+ * poll_mode_io:1- schedule complete completion from q cmd
+ */
+static unsigned int poll_mode_io;
+module_param_named(poll_mode_io, poll_mode_io, int, 0);
+MODULE_PARM_DESC(poll_mode_io,
+   "Complete cmds from IO path, (default=0)");
+
 MODULE_LICENSE("GPL");
 MODULE_VERSION(MEGASAS_VERSION);
 MODULE_AUTHOR("[EMAIL PROTECTED]");
@@ -892,6 +900,12 @@ megasas_queue_command(struct scsi_cmnd *
atomic_inc(>fw_outstanding);
 
instance->instancet->fire_cmd(cmd->frame_phys_addr 
,cmd->frame_count-1,instance->reg_set);
+   /*
+* Check if we have pend cmds to be completed
+*/
+   if (poll_mode_io && atomic_read(>fw_outstanding))
+   tasklet_schedule(>isr_tasklet);
+
 
return 0;
 
@@ -1981,6 +1995,47 @@ fail_fw_init:
 }
 
 /**
+ * megasas_start_timer - Initializes a timer object
+ * @instance:  Adapter soft state
+ * @timer: timer object to be initialized
+ * @fn:timer function
+ * @interval:  time interval between timer function call
+ */
+static inline void
+megasas_start_timer(struct megasas_instance *instance,
+   struct timer_list *timer,
+   void *fn, unsigned long interval)
+{
+   init_timer(timer);
+   timer->expires = jiffies + interval;
+   timer->data = (unsigned long)instance;
+   timer->function = fn;
+   add_timer(timer);
+}
+
+/**
+ * megasas_io_completion_timer - Timer fn
+ * @instance_addr: Address of adapter soft state
+ *
+ * Schedules tasklet for cmd completion
+ * if poll_mode_io is set
+ */
+static void
+megasas_io_completion_timer(unsigned long instance_addr)
+{
+   struct megasas_instance *instance =
+   (struct megasas_instance *)instance_addr;
+
+   if (atomic_read(>fw_outstanding))
+   tasklet_schedule(>isr_tasklet);
+
+   /* Restart timer */
+   if (poll_mode_io)
+   mod_timer(>io_completion_timer,
+   jiffies + MEGASAS_COMPLETION_TIMER_INTERVAL);
+}
+
+/**
  * megasas_init_mfi -  Initializes the FW
  * @instance:  Adapter soft state
  *
@@ -2106,8 +2161,14 @@ static int megasas_init_mfi(struct megas
* Setup tasklet for cmd completion
*/
 
-tasklet_init(>isr_tasklet, megasas_complete_cmd_dpc,
-(unsigned long)instance);
+   tasklet_init(>isr_tasklet, megasas_complete_cmd_dpc,
+   (unsigned long)instance);
+
+   /* Initialize the cmd completion timer */
+   if (poll_mode_io)
+   megasas_start_timer(instance, >io_completion_timer,
+   megasas_io_completion_timer,
+   MEGASAS_COMPLETION_TIMER_INTERVAL);
return 0;
 
   fail_fw_init:
@@ -2695,8 +2756,8 @@ static void megasas_shutdown_controller(
 }
 
 /**
- * megasas_suspend -driver suspend entry point
- * @pdev:   PCI device structure
+ * megasas_suspend -   driver suspend entry point
+ * @pdev:  PCI device structure
  * @state: PCI power state to suspend routine
  */
 static int __devinit
@@ -2708,6 +2769,9 @@ megasas_suspend(struct pci_dev *pdev, pm
instance = pci_get_drvdata(pdev);
host = instance->host;
 
+   if (poll_mode_io)
+   del_timer_sync(>io_completion_timer);
+
megasas_flush_cache(instance);
megasas_shutdown_controller(instance, MR_DCMD_HIBERNATE_SHUTDOWN);
tasklet_kill(>isr_tasklet);
@@ -2794,6 +2858,11 @@ megasas_resume(struct pci_dev *pdev)
if (megasas_start_aen(instance))

Re: [PATCH] task containersv11 add tasks file interface fix for cpusets

2007-10-03 Thread Paul Menage

On 10/3/07, Paul Jackson <[EMAIL PROTECTED]> wrote:
>
> But now (correct me if I'm wrong here) cgroups has a per-cgroup task
> list, and the above loop has cost linear in the number of tasks
> actually in the cgroup, plus (unfortunate but necessary and tolerable)
> the cost of taking a global css_set_lock, right?

Yes.

>
> And I take it the above code snipped is missing the cgroup_iter_start,
> correct?

Oops, yes.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Version 4 (2.6.23-rc8-mm2) Smack: Simplified Mandatory Access Control Kernel

2007-10-03 Thread Al Viro

On Wed, Oct 03, 2007 at 12:51:08PM -0700, Casey Schaufler wrote:
> > > Because you throw "simple" out the window when you require userland
> > > assistance to perform this function.
> > 
> > Any more than having /tmp replaced with a symlink?
> 
> Yes. By the way, there's nothing that really requires that you
> use a /smack symlink if you don't want to. /tmp can still be a
> real directory, a mount point, a symlink to /var/tmp, or whatever
> else you want it to be if that suits your needs better. For the
> simplest scenarios /tmp -> /smack/tmp -> /moldy/ has every
> other scheme I've seen throughly beaten.

And your point is?  If you don't use it, you get exact same complexity
in both setups.

> > _What_ userland intervention?  Mounting stuff under /smack/tmp and not under
> > your /moldy?
> 
> Who said anything about mounting under /moldy? I never did.

Sigh...  So put the binding into fstab and be done with that.

> > Having /tmp replaced with symlink to /smack/tmp.link instead
> > of replacing it with a symlink to /smack/tmp?
> > 
> > Absolute paths in that kind of thing are _wrong_.  You know where the things
> > are on your fs.  You don't know if anything else will be visible, let alone
> > whether it will be at the same place in all chroots or namespaces.  And no,
> > you _can't_ make sure that fs is visible only in one place.  No fs can or
> > has any business even trying.
> 
> Is the objection that there is a default value coded in?

Right now the main objection is about your lack of ability to read.  Which
part of "it can be mounted in different chroots/namespaces, therefore
having absolute paths doesn't work" is too hard to understand?

No, it's not about having a default.  It's about keeping an absolute pathname
in virtual fs, having all instances autosoddingmatically sharing it _and_
having change attempt in any instance automatically affect all of them.
If you have that kind of sharing, don't pretend that your mechanism really
allows absolute pathnames.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] task containersv11 add tasks file interface fix for cpusets

2007-10-03 Thread Paul Jackson

Andrew - please kill this patch.

Looks like Paul Menage has a better solution
that I will be trying out.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] task containersv11 add tasks file interface fix for cpusets

2007-10-03 Thread Paul Jackson

> What was wrong with my suggestion from a couple of emails back? Adding
> the following in cpuset_attach():
> 
> struct cgroup_iter it;
> struct task_struct *p;
> while ((p = cgroup_iter_next(cs->css.cgroup, ))) {
>set_cpus_allowed(p, cs->cpus_allowed);
> }
> cgroup_iter_end(cs->css.cgroup, );

Hmmm ... that just might work.

And this brings to light the reason (justification, excuse, whatever
you call it) that I probably didn't do this earlier.

In the dark ages before cgroups (aka containers) we did not have an
efficient way to walk the tasks in a cpuset. One had to walk the entire
task list, comparing task struct cpuset pointers.  On big honking NUMA
iron, one should avoid task list walks as much as one can get away
with, even if it meant sneaking in a little bit racey API.  Since some
updates of a cpusets 'cpus' mask don't need it (you happen to know that
all tasks in that cpuset are pause'd anyway) I might have made the
tradeoff to make this task list walk an explicitly invoked user action,
to be done only when needed.

But now (correct me if I'm wrong here) cgroups has a per-cgroup task
list, and the above loop has cost linear in the number of tasks
actually in the cgroup, plus (unfortunate but necessary and tolerable)
the cost of taking a global css_set_lock, right?

And I take it the above code snipped is missing the cgroup_iter_start,
correct?

I'll ask Andrew to kill this patch of mine, and I will test out your
suggestion this evening.

This still leaves the other creepy crawlies involving cpusets and hot
plug that I glimpsed slithering by in my last message.  I guess I'll
start a separate discussion with Cliff Wickman and whomever else I think
might want to be involved on those issues.

Nice work - thanks.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 >

1 - 100 of 716 matches

Mail list logo