new warning in ata_sg_clean

2007-03-19 Thread Meelis Roos
In todays 2.6.21-rc4+git the following news warning has appeared on my 
ppc computer:

  CC [M]  drivers/ata/libata-core.o
drivers/ata/libata-core.c: In function 'ata_sg_clean':
drivers/ata/libata-core.c:3558: warning: unused variable 'dir'

-- 
Meelis Roos ([EMAIL PROTECTED])
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_da_do_buf(2) at line 2087 of file fs/xfs/xfs_da_btree.c. Caller 0xc01b00bd

2007-03-19 Thread David Chinner
On Mon, Mar 19, 2007 at 11:32:27AM +0100, Marco Berizzi wrote:
> Marco Berizzi wrote:
> > David Chinner wrote:
> >
> >> Ok, so an ipsec change. And I see from the history below it
> >> really has nothing to do with this problem. it seems the problem
> >> has something to do with changes between 2.6.19.1 and 2.6.19.2.
> >
> > indeed. Yesterday at 13:00 I have switched from 2.6.19.1 to 2.6.19.2
> > (without the ipsec fix) and at about 17:30 linux has crashed again.
> > I have recompiled 2.6.19.2 with all kernel debugging options enabled
> > and rebooted. Now I'm waiting for the crash...
> 
> Linux has not been crashed. However here is dmesg output
> with all debugging option enabled: (search for 'INFO:
> possible recursive locking detected'). Is that normal?

.
> =
> [ INFO: possible recursive locking detected ]
> 2.6.19.2 #1
> -
> rm/470 is trying to acquire lock:
>  (&(>i_lock)->mr_lock){}, at: [] xfs_ilock+0x5b/0xa1
> 
> but task is already holding lock:
>  (&(>i_lock)->mr_lock){}, at: [] xfs_ilock+0x5b/0xa1
> 
> other info that might help us debug this:
> 3 locks held by rm/470:
>  #0:  (>i_mutex/1){--..}, at: [] do_unlinkat+0x70/0x115
>  #1:  (>i_mutex){--..}, at: [] mutex_lock+0x1c/0x1f
>  #2:  (&(>i_lock)->mr_lock){}, at: []
> xfs_ilock+0x5b/0xa1
> 
> stack backtrace:
>  [] dump_trace+0x215/0x21a
>  [] show_trace_log_lvl+0x1a/0x30
>  [] show_trace+0x12/0x14
>  [] dump_stack+0x19/0x1b
>  [] print_deadlock_bug+0xc0/0xcf
>  [] check_deadlock+0x6a/0x79
>  [] __lock_acquire+0x350/0x970
>  [] lock_acquire+0x75/0x97
>  [] down_write+0x3a/0x54
>  [] xfs_ilock+0x5b/0xa1
>  [] xfs_lock_dir_and_entry+0x105/0x11b
>  [] xfs_remove+0x180/0x47f
>  [] xfs_vn_unlink+0x22/0x4f
>  [] vfs_unlink+0x9e/0xa2
>  [] do_unlinkat+0xa8/0x115
>  [] sys_unlink+0x10/0x12
>  [] syscall_call+0x7/0xb
>  [] 0xb7efaa7d
>  ===

That's no problem - lockdep just doesn't know that we can nest i_lock
(we've got to get the annotations for this sorted out).

> Here is the relevant results:
> 
> Phase 2 - found root inode chunk
> Phase 3 - ...
> agno = 0
> ...
> agno = 12
> LEAFN node level is 1 inode 1610612918 bno = 8388608

Hmmm - single bit error in the bno - that reminds of this:

http://oss.sgi.com/projects/xfs/faq.html#dir2

So I'd definitely make sure that is repaired

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


PNPACPI probes serial twice, messes up serial console

2007-03-19 Thread Keith Owens
Dell SC1425 x86_64 running in i386 mode (the problem also occurs in
x86_64 mode).  Kernel 2.6.21-rc4, gcc 4.1.0.  Config extract at end.

Booting with 'console=tty console=ttyS0,9600'.  The serial console on
ttyS0 (0x3f8, irq 4) is probed twice, once from serial8250_init() and
again from serial_pnp_probe().  The serial console output is correct
until the second probe (from PNP) gets to these lines in
serial8250_config_port()

if (flags & UART_CONFIG_TYPE)
autoconfig(up, probeflags);

After the call to autoconfig(), the serial console starts printing NUL
characters instead of the console output.  The number of NUL characters
corresponds closely with the number of characters written to the VT
console, IOW it outputs each serial character as NUL instead of the
correct character.  When the kernel boots /sbin/init, the console
resets to printing normal characters.

AFAICT, the second probe of the UART is doing something nasty to the
hardware.  This is not a recent problem, I can reproduce the problem on
2.6.16.  Booting with pnpacpi=off removes the problem, but that
supresses all the PNPACPI code, not just the second probe of the serial
devices.

Should pnpacpi probe and setup the serial devices even when thay have
already been setup?  Or this is something strange about the UART in
this particular box?

FWIW, the serial console is plugged into a serial to USB converter
(pl2303), my laptop has no serial ports.  That should not make a
difference, but just in case it does ...

Config extract:

X86_32=y
GENERIC_TIME=y
CLOCKSOURCE_WATCHDOG=y
GENERIC_CLOCKEVENTS=y
GENERIC_CLOCKEVENTS_BROADCAST=y
LOCKDEP_SUPPORT=y
STACKTRACE_SUPPORT=y
SEMAPHORE_SLEEPERS=y
X86=y
MMU=y
ZONE_DMA=y
GENERIC_ISA_DMA=y
GENERIC_IOMAP=y
GENERIC_BUG=y
GENERIC_HWEIGHT=y
ARCH_MAY_HAVE_PC_FDC=y
DMI=y
DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
EXPERIMENTAL=y
LOCK_KERNEL=y
INIT_ENV_ARG_LIMIT=32
LOCALVERSION="-i386-kaos"
LOCALVERSION_AUTO=y
SWAP=y
SYSVIPC=y
SYSVIPC_SYSCTL=y
POSIX_MQUEUE=y
IKCONFIG=y
IKCONFIG_PROC=y
SYSFS_DEPRECATED=y
CC_OPTIMIZE_FOR_SIZE=y
SYSCTL=y
EMBEDDED=y
SYSCTL_SYSCALL=y
KALLSYMS=y
KALLSYMS_ALL=y
HOTPLUG=y
PRINTK=y
BUG=y
ELF_CORE=y
BASE_FULL=y
FUTEX=y
EPOLL=y
SHMEM=y
SLAB=y
VM_EVENT_COUNTERS=y
RT_MUTEXES=y
BASE_SMALL=0
MODULES=y
MODULE_UNLOAD=y
KMOD=y
STOP_MACHINE=y
BLOCK=y
LBD=y
LSF=y
IOSCHED_NOOP=y
IOSCHED_AS=y
IOSCHED_DEADLINE=y
IOSCHED_CFQ=y
DEFAULT_DEADLINE=y
DEFAULT_IOSCHED="deadline"
TICK_ONESHOT=y
HIGH_RES_TIMERS=y
SMP=y
X86_PC=y
MPENTIUM4=y
X86_CMPXCHG=y
X86_L1_CACHE_SHIFT=7
RWSEM_XCHGADD_ALGORITHM=y
GENERIC_CALIBRATE_DELAY=y
X86_WP_WORKS_OK=y
X86_INVLPG=y
X86_BSWAP=y
X86_POPAD_OK=y
X86_CMPXCHG64=y
X86_GOOD_APIC=y
X86_INTEL_USERCOPY=y
X86_USE_PPRO_CHECKSUM=y
X86_TSC=y
HPET_TIMER=y
HPET_EMULATE_RTC=y
NR_CPUS=8
SCHED_SMT=y
PREEMPT_NONE=y
X86_LOCAL_APIC=y
X86_IO_APIC=y
X86_MCE=y
X86_MCE_NONFATAL=y
X86_MCE_P4THERMAL=y
MICROCODE=m
MICROCODE_OLD_INTERFACE=y
X86_MSR=m
X86_CPUID=m
HIGHMEM4G=y
VMSPLIT_3G=y
PAGE_OFFSET=0xC000
HIGHMEM=y
ARCH_FLATMEM_ENABLE=y
ARCH_SPARSEMEM_ENABLE=y
ARCH_SELECT_MEMORY_MODEL=y
ARCH_POPULATES_NODE_MAP=y
SELECT_MEMORY_MODEL=y
FLATMEM_MANUAL=y
FLATMEM=y
FLAT_NODE_MEM_MAP=y
SPARSEMEM_STATIC=y
SPLIT_PTLOCK_CPUS=4
ZONE_DMA_FLAG=1
MTRR=y
IRQBALANCE=y
HZ_250=y
HZ=250
PHYSICAL_START=0x10
PHYSICAL_ALIGN=0x10
COMPAT_VDSO=y
ARCH_ENABLE_MEMORY_HOTPLUG=y
PM=y
ACPI=y
ACPI_PROCFS=y
ACPI_BUTTON=m
ACPI_FAN=m
ACPI_PROCESSOR=m
ACPI_BLACKLIST_YEAR=0
ACPI_EC=y
ACPI_POWER=y
ACPI_SYSTEM=y
PCI=y
PCI_GOANY=y
PCI_BIOS=y
PCI_DIRECT=y
PCI_MMCONFIG=y
PCIEPORTBUS=y
PCIEAER=y
PCI_MSI=y
HT_IRQ=y
ISA_DMA_API=y
BINFMT_ELF=y
BINFMT_MISC=m
NET=y
PACKET=y
PACKET_MMAP=y
UNIX=y
XFRM=y
INET=y
IP_MULTICAST=y
IP_ADVANCED_ROUTER=y
ASK_IP_FIB_HASH=y
IP_FIB_HASH=y
IP_ROUTE_MULTIPATH=y
IP_ROUTE_VERBOSE=y
SYN_COOKIES=y
INET_XFRM_MODE_BEET=y
INET_DIAG=y
INET_TCP_DIAG=y
TCP_CONG_CUBIC=y
DEFAULT_TCP_CONG="cubic"
NETFILTER=y
NETFILTER_NETLINK=m
NETFILTER_NETLINK_LOG=m
NETFILTER_XTABLES=y
NETFILTER_XT_TARGET_CLASSIFY=m
NETFILTER_XT_TARGET_MARK=m
NETFILTER_XT_MATCH_COMMENT=m
NETFILTER_XT_MATCH_DCCP=m
NETFILTER_XT_MATCH_ESP=m
NETFILTER_XT_MATCH_LENGTH=m
NETFILTER_XT_MATCH_LIMIT=m
NETFILTER_XT_MATCH_MAC=m
NETFILTER_XT_MATCH_MARK=m
NETFILTER_XT_MATCH_MULTIPORT=m
NETFILTER_XT_MATCH_PKTTYPE=m
NETFILTER_XT_MATCH_QUOTA=m
NETFILTER_XT_MATCH_REALM=m
NETFILTER_XT_MATCH_SCTP=m
NETFILTER_XT_MATCH_STATISTIC=m
NETFILTER_XT_MATCH_TCPMSS=m
IP_NF_IPTABLES=y
IP_NF_MATCH_IPRANGE=m
IP_NF_MATCH_TOS=m
IP_NF_MATCH_RECENT=m
IP_NF_MATCH_ECN=m
IP_NF_MATCH_AH=m
IP_NF_MATCH_TTL=m
IP_NF_MATCH_OWNER=m
IP_NF_MATCH_ADDRTYPE=m
IP_NF_FILTER=y
IP_NF_TARGET_REJECT=y
IP_NF_TARGET_ULOG=y
VLAN_8021Q=y
NET_CLS_ROUTE=y
STANDALONE=y
PREVENT_FIRMWARE_BUILD=y
FW_LOADER=m
CONNECTOR=m
PNP=y
PNP_DEBUG=y
PNPACPI=y
BLK_DEV_FD=m
BLK_DEV_LOOP=m
IDE=m
IDE_MAX_HWIFS=4
BLK_DEV_IDE=m
BLK_DEV_IDEDISK=m
IDEDISK_MULTI_MODE=y
BLK_DEV_IDECD=m
IDE_TASK_IOCTL=y
BLK_DEV_IDEPCI=y
IDEPCI_SHARE_IRQ=y
BLK_DEV_IDEDMA_PCI=y
IDEDMA_PCI_AUTO=y
BLK_DEV_PIIX=m
BLK_DEV_IDEDMA=y

Re: [PATCH] vt: fix a potential race in the VT_WAITACTIVE handler

2007-03-19 Thread Andrew Morton
On Thu, 15 Mar 2007 15:10:23 +0100 Michal Januszewski <[EMAIL PROTECTED]> wrote:

> On a multiprocessor machine the VT_WAITACTIVE ioctl call may return 0
> if fg_console has already been updated in redraw_screen(), but the 
> console switch itself hasn't been completed. Fix this by checking
> fg_console in vt_waitactive() with the console sem held.
> 
> Signed-off-by: Michal Januszewski <[EMAIL PROTECTED]>
> 
> ---
> diff --git a/drivers/char/vt_ioctl.c b/drivers/char/vt_ioctl.c
> index 3a5d301..00b5b34 100644
> --- a/drivers/char/vt_ioctl.c
> +++ b/drivers/char/vt_ioctl.c
> @@ -1041,8 +1041,12 @@ int vt_waitactive(int vt)
>   for (;;) {
>   set_current_state(TASK_INTERRUPTIBLE);
>   retval = 0;
> - if (vt == fg_console)
> + acquire_console_sem();
> + if (vt == fg_console) {
> + release_console_sem();
>   break;
> + }
> + release_console_sem();
>   retval = -EINTR;
>   if (signal_pending(current))
>   break;
> 

OK.  I think.  It's hard to tell.  I assume that the acquire_console_sem()
in here is to synchronise against some other function which also takes
acquire_console_sem(), but it is not clear which.

So could you please redo this with a comment which tells the reader exactly
what's being protected against what, and why?

Also, I always feel a bit worried by:

set_current_state(TASK_INTERRUPTIBLE);
down(...);

because if it hits contention, the down() will undo the
set_curremt_state().  Now that's normally OK because we loop, and because
the semaphore won't normally be 100% contended all the time.  Unless
someone reimplements down() so it happens to return in state TASK_RUNNING
all the time, which they could legitimately do (although this would
probably break stuff such as the above).


But still, it is nicer to do

down(...);
set_current_state(TASK_INTERRUPTIBLE);

if possible, and I think it is possible here.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Sanitize filesystem NLS handling

2007-03-19 Thread Alexander E. Patrakov

OGAWA Hirofumi wrote:

"Alexander E. Patrakov" <[EMAIL PROTECTED]> writes:

But, anyway, this is a separate issue that my patch doesn't attempt to 
correct. The conclusion so far is that we disagree, and that there are 
situations where using utf8 iocharset is the least of all evils, so the 
warning is not justified enough. Reproducible testcase:


Again, I don't care about read at all. And why don't you use "utf8"
option, instead of "iocharset=utf8". "iocharset=utf8" is warned until
it is fixed. The "utf8" also doesn't work correctly in some case though.


Would it be OK for you if I add the mount-time check for iocharset=utf8 to 
the fat filesystem and silently replace this with the "utf8" option, instead 
of overly actively warning the users? This way, the sysfs option and the 
nls_base.iocharset module parameter will still work as I want.



I'm talking about two filesystems on a system here, not two encoding
on one filesystem.
I am also talking about this. Mounting two filesystems with different 
iocharsets is insane, because this will result in one of the following outcomes:


1) "ls" will show wrong characters in filenames on one of the filesystems
2) one of the two filesystems will contain wrong on-disk data for filenames, 
that, when misinterpreted by mounting with wrong iocharset, results in 
seemingly-correct output, but is misunderstood by the properly set up 
reference implementation (that's what is likely to happen with jfs in your 
example).


Because you didn't change the locale. And it is your policy, right?


Yes. This is because I have some files with non-ASCII names in my home 
directory. Changing the locale would make these filenames look wrong until I 
change it back.


--
Alexander E. Patrakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote:
> On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
> > Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > > 
> > > well we can do the handshake to take ownership like we do much later in
> > > boot, but that requires PCI to be there and fully discovered, which we
> > > don't have this early.
> > 
> > That's not true - we do early pci discovery. Doing USB handsoff
> > there would be quite possible.
> 
> What, we don't do USB "handoff" early enough in the boot process?  It's
> happening at PCI quirk time now, which I think should be early enough
> for everyone (and too early for some who rely on USB keyboards and
> initramfs shells...)

It happens way after the CPUs are brought up. At this point both the
delay loop calibration and the local APIC calibration are already done.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 20:05 -0500, Matt Mackall wrote:
> On Tue, Mar 20, 2007 at 01:42:46AM +0100, Thomas Gleixner wrote:
> > On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
> > > This is exactly the same problem as booting on a desktop PC. But
> > > somehow LILO manages. My first Linux box had a hell of a lot less disk
> > > than the platform I bootstrapped (and wrote NAND drivers for) last
> > > month had in NAND.
> > 
> > No, it is not. You get the absolute sector address of your second stage
> > and this is a complete nobrainer. The translation is done in the DISK
> > device.
> 
> LILO and friends manage to boot systems that use software RAID and
> LVM. There are multiple methods. Some use block lists, some use tiny
> boot partitions, etc. All of them are applicable to controllerless NAND.

Yes, by using fixed addresses, which is not what I want.

> > You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
> > whatever - there is a more or less intellegent controller device, which
> > does the mapping to the physical storage location. There is _NO_ such
> > thing on a bare FLASH chip.
> 
> How many times do I have to tell you that I wrote a driver for
> controllerless NAND just last month?

Wow. I'm impressed because I'm pulling my opinion out of thin air.

> > How exactly does device mapper:
> > 
> > A) across device wear levelling ?
> 
> The same way UBI does, but encapsulated in a device mapper layer.

Does the device mapper do that ?

> > B) dynamic partitioning for FLASH aware file systems ?
>
> See above.

Does the device mapper do that ?

> > C) across device wear levelling for FLASH aware file systems ?
> 
> See above.

Look at your own drawing. 

> > D) background bit-flip corrections (copying affected blocks and recylce
> > the old one) ?
> 
> See above.

Repeating patterns do not impress me. Your drawing tells otherwise

> > E) allow position independent placement of the second stage bootloader ?
> 
> See way above to my LILO response.

Neither LILO nor GRUB have search capabilities for randomly located
second stage loaders.

> > > > You need to implement a clever journalling block device
> > > > emulator in order to keep the data alive and the FLASH not weared out
> > > > within no time. You need the wear levelling, otherwise you can throw
> > > > away your FLASH in no time.
> > > 
> > > And that's why it's in my picture.
> > 
> > Yes, it is in your picture, but:
> > 
> > 1) it excludes FLASH aware file systems and UBI does not.
> > 2) your picture does still not explain how it does achive the above A),
> > B), C), D) and E)
> > 
> > Your extra path for partitioning(4) and JFFS2 is just a weird hack,
> > which makes your proposal completely absurd.
> 
> No, it's just there to show the flexibility of device mapper. But I have
> the sneaking suspicion you have no idea how device mapper works.

Sigh. Layering violation == flexibility.

> In brief: device mapper takes one or more devices, applies a mapping
> to them, and returns a new device. For example, take various spans of
> /dev/hda1 and /dev/sda3 and present them as new-device1. Take
> new-device1 and transform it with dm-crypt to get new-device2. The
> kernel doesn't decide how to do this, any more than it decides where
> to mount your filesystems. Userspace does.

I know how it works. But your blurb does not answer any of my questions.

> > > > > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > > > >snapshot, etc.).
> > > > 
> > > > Why should we reimplement that ?
> > > 
> > > So that you can get encryption and snapshot, etc.?
> > 
> > 1. On top of a clever block device.
> > 
> > 2. UBI can do snapshots by design.
> 
> Oh, so you HAVE reimplemented it.

No, it already works

> > 3. Encryption should be done on the VFS layer and not below the
> > filesystem layer. Doing it inside the block layer or the device mapper
> > is broken by design.
> 
> That's highly debatable and not a topic for this thread.

I see, you define, what has to be discussed.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] split file and anonymous page queues #2

2007-03-19 Thread Nick Piggin

Rik van Riel wrote:

Split the anonymous and file backed pages out onto their own pageout
queues.  This we do not unnecessarily churn through lots of anonymous
pages when we do not want to swap them out anyway.

This should (with additional tuning) be a great step forward in
scalability, allowing Linux to run well on very large systems where
scanning through the anonymous memory (on our way to the page cache
memory we do want to evict) is slowing systems down significantly.

This patch has been stress tested and seems to work, but has not
been fine tuned or benchmarked yet.  For now the swappiness parameter
can be used to tweak swap aggressiveness up and down as desired, but
in the long run we may want to simply measure IO cost of page cache
and anonymous memory and auto-adjust.

We apply pressure to each of sets of the pageout queues based on:
- the size of each queue
- the fraction of recently referenced pages in each queue,
   not counting used-once file pages
- swappiness (file IO is more efficient than swap IO)

Please take this patch for a spin and let me know what goes well
and what goes wrong.


This ignores whether a file page is mapped, doesn't it?

Even so, it could be a good approach anyway.

There are a couple of little nice improvements you have there, such as
treating shmem pages in the same class as anon pages. We found that we
needed something similar, so some of those things should go upstream
on their own.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL v0.31

2007-03-19 Thread Willy Tarreau
On Mon, Mar 19, 2007 at 08:11:55PM -0700, Linus Torvalds wrote:
> Quite frankly, I was *planning* on merging RSDL very early after 2.6.21, 
> but there is one thing that has turned me completely off the whole thing:
> 
>  - the people involved seem to be totally unwilling to even admit there 
>might be a problem.
> 
> This is like alcoholism. If you cannot admit that you might have a 
> problem, you'll never get anywhere. And quite frankly, the RSDL proponents 
> seem to be in denial ("we're always better", "it's your problem if the old 
> scheduler works better", "just one report of old scheduler being better").
> 
> And the thing is, if people aren't even _willing_ to admit that there may 
> be issues, there's *no*way*in*hell* I will merge it even for testing. 
> Because the whole and only point of merging RSDL was to see if it could 
> replace the old scheduler, and the most important feature in that case is 
> not whether it is perfect, BUT WHETHER ANYBODY IS INTERESTED IN TRYING TO 
> FIX THE INEVITABLE PROBLEMS!

Linus, you're unfair with Con. He initially was on this position, and lately
worked with Mike by proposing changes to try to improve his X responsiveness.
But he's ill right now and cannot touch the keyboard, so only his supporters
speak for him, and as you know, speech is not code and does not fix problems.

Leave him a week or so to relieve and let's see what he can propose. Hopefully
a week away from the keyboard will help him think with a more general approach.
Also, Mike has already modified the code a bit to get better experience.

Also, while I don't agree with starting to renice X to get something usable,
it seems real that there's something funny on Mike's system which makes it
behave particularly strangely when combined with RSDL, because other people
in comparable tests (including me) have found X perfectly smooth even with
loads in the tens or even hundreds. I really suspect that we will find a bug
in RSDL which triggers the problem and that this fix will help discover
another problem on Mike's hardware which was not triggered by mainline.

Regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 4/6] mm: merge populate and nopage into fault (fixes nonlinear)

2007-03-19 Thread Nick Piggin
On Mon, Mar 19, 2007 at 09:44:28PM +0100, Blaisorblade wrote:
> On Sunday 18 March 2007 03:50, Nick Piggin wrote:
> > > >
> > > > Yes, I believe that is the case, however I wonder if that is going to
> > > > be a problem for you to distinguish between write faults for clean
> > > > writable ptes, and write faults for readonly ptes?
> > >
> > > I wouldn't be able to distinguish them, but am I going to get write
> > > faults for clean ptes when vma_wants_writenotify() is false (as seems to
> > > be for tmpfs)? I guess not.
> > >
> > > For tmpfs pages, clean writable PTEs are mapped as writable so they won't
> > > give any problem, since vma_wants_writenotify() is false for tmpfs.
> > > Correct?
> >
> > Yes, that should be the case. So would this mean that nonlinear protections
> > don't work on regular files?
> 
> They still work in most cases (including for UML), but if the initial mmap() 
> specified PROT_WRITE, that is ignored, for pages which are not remapped via 
> remap_file_pages(). UML uses PROT_NONE for the initial mmap, so that's no 
> problem.

But how are you going to distinguish a write fault on a readonly pte for
dirty page accounting vs a read-only nonlinear protection?

You can't store any more data in a present pte AFAIK, so you'd have to
have some out of band data. At which point, you may as well just forget
about vma_wants_writenotify vmas, considering that everybody is using
shmem/ramfs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
> For VMI, the default clobber was "cc", and you need a way to allow at
> least that, because saving and restoring flags is too expensive on x86.

According to lore (Andi, I think), asm() always clobbers cc. 

> I still don't think this was a good trade.  The primary motivation for
> clobbering %eax was that Xen wanted a free register to use for
> computing the offset into the shared data in the case of SMP
> preemptible kernels.  Xen no longer needs such a register, they can
> use the PDA offset instead.  And it does hurt native performance by
> unconditionally stealing a register in the four most commonly invoked
> paravirt-ops code sequences.

Actually, it still does need a temp register.  The sequence for cli is:

mov %fs:xen_vcpu, %eax
movb $1,1(%eax)

At some point I hope to move the vcpu structure directly into the
pda/percpu variables, at which point it will need no temps.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Eric St-Laurent
On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:

> I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
> not to mention people trying to spec out hardware for RT
> applications...

There is a SMI disabling module in RTAI, check the smi-module.c in this:

https://www.rtai.org/RTAI/rtai-3.5.tar.bz2

More infos:

http://www.captain.at/rtai-smi-high-latency.php
http://www.captain.at/xenomai-smi-high-latency.php

It might make sense to merge this code, at least in the -rt tree.


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-03-19 Thread Nick Piggin

Christoph Hellwig wrote:

On Mon, Mar 19, 2007 at 09:11:31PM +1100, Nick Piggin wrote:


I've got the patches in -mm now. I hope they will get merged when the
the next window opens.

I didn't submit the ->page_mkwrite conversion yet, because I didn't
have any callers to look at. It is is slightly less trivial than for
nopage and nopfn, so having David's block_page_mkwrite is helpful.



Yes.  I was just wondering whether it makes more sense to do this
functionality directly ontop of ->fault instead of converting i over
real soon.


I would personally prefer that, but I don't want to block David's
patch from being merged if the ->fault patches do not get in next
cycle. If the fault patches do make it in first, then yes we should
do the page_mkwrite conversion before merging David's patch.

I'll keep an eye on it, and try to do the right thing.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Nick Piggin

Nick Piggin wrote:

Andrew Morton wrote:

On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin 
<[EMAIL PROTECTED]> wrote:




Andrew Morton wrote:




Hang on a sec... I'll try fixing the thing before you next make a
release.




Too late.  hot-fixes/ awaits thee.



Awww... well thanks very much Michal for reporting the bug, I reproduced
it easily and it turns out to be a typo.

In my testing I never had a lot of writeout going on, so most of the pages
will have been truncated in the first loop...


Also, noticed another problem in the same general area. Andrew you were
indeed right to question the removal of that unmap_mapping_range call,
but I think even it alone it wasn't enough...

--
SUSE Labs, Novell Inc.
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

Anyway, fix that omission.

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1905,7 +1905,18 @@ int vmtruncate(struct inode * inode, lof
if (IS_SWAPFILE(inode))
goto out_busy;
i_size_write(inode, offset);
+
+   /*
+* unmap_mapping_range is called twice, first simply for efficiency
+* so that truncate_inode_pages does fewer single-page unmaps. However
+* after this first call, and before truncate_inode_pages finishes,
+* it is possible for private pages to be COWed, which remain after
+* truncate_inode_pages finishes, hence the second unmap_mapping_range
+* call must be made for correctness.
+*/
+   unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
truncate_inode_pages(mapping, offset);
+   unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
goto out_truncate;
 
 do_expand:
@@ -1943,7 +1954,9 @@ int vmtruncate_range(struct inode *inode
 
mutex_lock(>i_mutex);
down_write(>i_alloc_sem);
+   unmap_mapping_range(mapping, offset, (end - offset), 1);
truncate_inode_pages_range(mapping, offset, end);
+   unmap_mapping_range(mapping, offset, (end - offset), 1);
inode->i_op->truncate_range(inode, offset, end);
up_write(>i_alloc_sem);
mutex_unlock(>i_mutex);


Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Nick Piggin

Andrew Morton wrote:

On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote:



Andrew Morton wrote:




Hang on a sec... I'll try fixing the thing before you next make a
release.




Too late.  hot-fixes/ awaits thee.


Awww... well thanks very much Michal for reporting the bug, I reproduced
it easily and it turns out to be a typo.

In my testing I never had a lot of writeout going on, so most of the pages
will have been truncated in the first loop...

--
SUSE Labs, Novell Inc.
Fix typo in do_no_page vs invalidate race fix patch.

Index: linux-2.6/mm/truncate.c
===
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -235,7 +235,7 @@ void truncate_inode_pages_range(struct a
wait_on_page_writeback(page);
if (page_mapped(page)) {
unmap_mapping_range(mapping,
- (loff_t)page_index next)


Re: [patch 00/31] 2.6.20-stable review

2007-03-19 Thread Gene Heskett
On Monday 19 March 2007, Greg KH wrote:
>This is the start of the stable review cycle for the 2.6.20.4 release.
>There are 31 patches in this series, all will be posted as a response
>to this one.  If anyone has any issues with these being applied, please
>let us know.  If anyone is a maintainer of the proper subsystem, and
>wants to add a Signed-off-by: line to the patch, please respond with it.
>
>These patches are sent out with a number of different people on the
>Cc: line.  If you wish to be a reviewer, please email [EMAIL PROTECTED]
>to add your name to the list.  If you want to be off the reviewer list,
>also email us.
>
>Responses should be made by Thursday March, 22, 15:00:00 UTC.
>Anything received after that time might be too late.

BINGO!  One of these 31 patches may be the guilty party that's playing
tricks with tar's mind.  I'm running 2.6.20.4-rc1 on an older athlon 
xp2800 with a gig of ram.

Amanda has gotten through the estimate phase and is now doing the backup.  
It will fail, out of tape.  Here is an amstatus output as its running 
right now.

coyote:/GenesAmandaHelper-0.5 3 planner: [dumps way too big, 350850 KB, must 
skip incremental dumps]
coyote:/GenesAmandaHelper-0.6 1 planner: [dumps way too big, 184977 KB, must 
skip incremental dumps]
coyote:/bin   1 planner: [dumps way too big, 1110 KB, must skip 
incremental dumps]
coyote:/boot  13m wait for dumping
coyote:/dev   1 planner: [dumps way too big, 290 KB, must skip 
incremental dumps]
coyote:/etc   1 planner: [dumps way too big, 18291 KB, must 
skip incremental dumps]
coyote:/home  0 1018m wait for dumping
coyote:/lib   3 planner: [dumps way too big, 11705 KB, must 
skip incremental dumps]
coyote:/opt   15m wait for dumping
coyote:/root  3 planner: [dumps way too big, 785963 KB, must 
skip incremental dumps]
coyote:/sbin  1 planner: [dumps way too big, 10 KB, must skip 
incremental dumps]
coyote:/tmp   4   32m wait for dumping
coyote:/usr/X11R6 12m wait for dumping
coyote:/usr/bin   1 planner: [dumps way too big, 339170 KB, must 
skip incremental dumps]
coyote:/usr/dlds  1 planner: [dumps way too big, 2140 KB, must skip 
incremental dumps]
coyote:/usr/dlds-misc 30m wait for dumping
coyote:/usr/dlds-rpms 1 planner: [dumps way too big, 3130 KB, must skip 
incremental dumps]
coyote:/usr/dlds-tgzs 1 planner: [dumps way too big, 10 KB, must skip 
incremental dumps]
coyote:/usr/games 00m wait for dumping
coyote:/usr/include   1 planner: [dumps way too big, 10557 KB, must 
skip incremental dumps]
coyote:/usr/kerberos  10m wait for dumping
coyote:/usr/lib   1 planner: [dumps way too big, 474409 KB, must 
skip incremental dumps]
coyote:/usr/libexec   2 planner: [dumps way too big, 11285 KB, must 
skip incremental dumps]
coyote:/usr/local 2  279m wait for dumping
coyote:/usr/man   10m wait for dumping
coyote:/usr/movies2 7271m dumping 5485m ( 75.44%) (0:12:47)
coyote:/usr/music 1 planner: [dumps way too big, 2448290 KB, must 
skip incremental dumps]
coyote:/usr/pix   2   17m wait for dumping
coyote:/usr/sbin  1 planner: [dumps way too big, 3254 KB, must skip 
incremental dumps]
coyote:/usr/share 3 planner: [dumps way too big, 40514 KB, must 
skip incremental dumps]
coyote:/usr/src   3 6822m wait for dumping
coyote:/var   1  366m wait for dumping

SUMMARY  part  real  estimated
   size   size
partition   :  32
estimated   :  3231973m
flush   :   0 0m
failed  :  1816155m   ( 50.53%)
wait for dumping:  13 8547m   ( 26.73%)
dumping to tape :   00m   (  0.00%)
dumping :   1  5485m  7271m ( 75.44%) ( 17.16%)
dumped  :   0 0m 0m (  0.00%) (  0.00%)
wait for writing:   0 0m 0m (  0.00%) (  0.00%)
wait to flush   :   0 0m 0m (100.00%) (  0.00%)
writing to tape :   0 0m 0m (  0.00%) (  0.00%)
failed to tape  :   0 0m 0m (  0.00%) (  0.00%)
taped   :   0 0m 0m (  0.00%) (  0.00%)
  tape 1:   0 0m 0m (  0.00%) Dailys-19
8 dumpers idle  : not-idle
taper idle
network free kps:  6800
holding space   : 71118m (100.00%)
 dumper0 busy   :  0:00:00  (  0.00%)
 0 dumpers busy :  0:00:00  (  0.00%)
 1 dumper busy  :  0:00:00  (  0.00%)

The directory shown on line one of this report actually has:
[EMAIL PROTECTED] /]# du -h /GenesAmandaHelper-0.5/
1.6G

Re: [PATCH] powerpc minor pagefault optimization with kprobes enabled

2007-03-19 Thread Anton Blanchard

> I've attached a patch below the optimizes this code path for powerpc,
> but the scheme applies to all architectures aswell.  It just rips out all
> the callachin madness, and does as good as it gets in the pagefault
> handler:

NAK, patch on the way to get rid of all the debugger() crap by using
this very hook.

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Lee Revell

On 3/16/07, Thomas Gleixner <[EMAIL PROTECTED]> wrote:

Yes, this is probably caused by SMM code trying to emulate a PS/2
keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
have no way to disable this BIOS misfeature in the early boot process.


https://mail.rtai.org/pipermail/rtai/2003-March/002949.html

http://www.embeddedrelated.com/usenet/embedded/show/50333-1.php

I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
not to mention people trying to spec out hardware for RT
applications...

Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.21-rc4-mm1

2007-03-19 Thread Andrew Morton

Temporarily at

  http://userweb.kernel.org/~akpm/2.6.21-rc4-mm1/

Will appear later at

  
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc4/2.6.21-rc4-mm1/



- Restored the RSDL CPU scheduler (a new version thereof)



Boilerplate:

- See the `hot-fixes' directory for any important updates to this patchset.

- To fetch an -mm tree using git, use (for example)

  git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git 
tag v2.6.16-rc2-mm1
  git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1

- -mm kernel commit activity can be reviewed by subscribing to the
  mm-commits mailing list.

echo "subscribe mm-commits" | mail [EMAIL PROTECTED]

- If you hit a bug in -mm and it is not obvious which patch caused it, it is
  most valuable if you can perform a bisection search to identify which patch
  introduced the bug.  Instructions for this process are at

http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt

  But beware that this process takes some time (around ten rebuilds and
  reboots), so consider reporting the bug first and if we cannot immediately
  identify the faulty patch, then perform the bisection search.

- When reporting bugs, please try to Cc: the relevant maintainer and mailing
  list on any email.

- When reporting bugs in this kernel via email, please also rewrite the
  email Subject: in some manner to reflect the nature of the bug.  Some
  developers filter by Subject: when looking for messages to read.

- Occasional snapshots of the -mm lineup are uploaded to
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on
  the mm-commits list.


Changes since 2.6.21-rc3-mm1:


 origin.patch
 git-acpi.patch
 git-alsa.patch
 git-arm-master.patch
 git-arm.patch
 git-avr32.patch
 git-cifs.patch
 git-cpufreq.patch
 git-powerpc.patch
 git-drm.patch
 git-dvb.patch
 git-gfs2-nmw.patch
 git-hid.patch
 git-ia64.patch
 git-ieee1394.patch
 git-infiniband.patch
 git-input.patch
 git-kbuild.patch
 git-kvm.patch
 git-leds.patch
 git-libata-all.patch
 git-md-accel.patch
 git-mmc.patch
 git-mtd.patch
 git-ubi.patch
 git-netdev-all.patch
 git-ioat.patch
 git-ocfs2.patch
 git-parisc.patch
 git-selinux.patch
 git-pciseg.patch
 git-s390.patch
 git-sh.patch
 git-scsi-misc.patch
 git-scsi-rc-fixes.patch
 git-unionfs.patch
 git-wireless.patch
 git-ipwireless_cs.patch
 git-gccbug.patch

 git trees

-uml-hostfs-fix-double-free.patch
-uml-hostfs-make-hostfs=-option-work-as-a-jail-as-intended.patch
-uml-fix-a-memory-leak-in-the-multicast-driver.patch
-uml-remove-dead-code-about-os_usr1_signal-and-os_usr1_process.patch
-uml-mark-both-consoles-as-con_anytime.patch
-uml-fix-confusion-irq-early-reenabling.patch
-uml-activate_fd-return-enomem-only-when-appropriate.patch
-uml-fix-errno-usage.patch
-x86_64-fix-2618-regression-ptrace_oldsetoptions-should-be-accepted.patch
-bluetooth-fix-socket-locking-in-hci_sock_dev_event.patch
-add-epoll-compat_-code-to-fs-compatc.patch
-check_partition-fix-error-check.patch
-uml-arch_prctl-should-set-thread-fs.patch
-connector-bugfix-for-cn_call_callback.patch
-26-altix-console-fix-for-config_debug_shirq-usage.patch
-ecryptfs-nested-locking-annotation.patch
-swsusp-disable-nonboot-cpus-before-entering-platform-suspend.patch
-paravirt-build-fixes.patch
-acpi-disabled-due-to-dmi-failure-or-blacklisted-year-should-be-noted-as-is-done-with-other-acpi-blacklisting.patch
-git-alsa-oops-fix.patch
-avr32-dma-mappingh.patch
-gregkh-driver-device-symlink.patch
-gregkh-driver-platform-reorder-platform_device_del.patch
-gregkh-driver-remove-devfs-from-maintainers.patch
-gregkh-driver-driver-core-export-device_rename.patch
-gregkh-driver-uio-irq.patch
-scheduled-removal-of-sa_xxx-interrupt-flags-fixups-4.patch
-make-drivers-char-drm-drm_vmcdrm_io_prot-static.patch
-fix-saa7146_clipping_mem-size.patch
-drivers-media-video-cpia_ppc-dont-use-_work_nar.patch
-dvb-core-fix-several-locking-related-problems.patch
-saa7134-fix-modules=n-compilation.patch
-ivtv-warning-fix.patch
-jdelvare-i2c-i2c-03-use-i2c_adapterdevparent-for-messages.patch
-jdelvare-i2c-i2c-i801-restore-initial-state.patch
-jdelvare-i2c-ds1374-check-for-workqueue-creation.patch
-crash-on-evdev-disconnect.patch
-expose-set_mode-method-so-it-can-be-wrapped.patch
-ata_piix-remove-ugly-layering-violation.patch
-pata_cmd640-multiple-updates.patch
-ide-cmd64x-fix-recovery-time-calculation-take2.patch
-mtd-maps-ck804xromc-pci_module_init-to-pci_register_driver.patch
-mtd-chips-oops-in-cfi_amdstd_sync.patch
-mtd-esb2-check-for-closed-rom-window.patch
-dilnetpc-fix-warning.patch
-mtd-correct-misspelled-preprocessor-variable.patch
-git-netdev-all-ipw2200-fix.patch
-mv643xx-ethernet-driver-irq-registration-fix.patch
-via-rhine-set-avoid_d3-for-broken-bioses.patch
-netxen-fix-warnings.patch
-e1000-fix-be-ready-for-incoming-irq-at-pci_request_irq.patch
-e1000-fix-firmware-handover-bits.patch
-e1000-fix-stop-raw-interrupts-disabled-nag-from-rt.patch

Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Andrew Morton
On Tue, 20 Mar 2007 13:47:53 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Mon, 19 Mar 2007 17:58:52 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote:
> > 
> > 
> >>The kernel without Nick's patchset but with the assert runs OK too.  Under
> >>the principle of mm-has-been-too-flakey-lately, I'll drop the patches:
> >>
> >>mm-debug-check-for-the-fault-vs-invalidate-race.patch
> >>mm-simplify-filemap_nopage.patch
> >>mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> >>mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> >>mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-tidy.patch
> >>mm-merge-nopfn-into-fault.patch
> >>mm-merge-nopfn-into-fault-fix.patch
> >>mm-remove-legacy-cruft.patch
> > 
> > 
> > ug, too many rejects.  I'll leave them in, minus
> > mm-debug-check-for-the-fault-vs-invalidate-race.patch
> > 
> 
> Hang on a sec... I'll try fixing the thing before you next make a
> release.
> 

Too late.  hot-fixes/ awaits thee.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Paul Mackerras
Christoph Lameter writes:

> +static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void 
> *))
> +{

...

> + p = (void *)__get_free_page(flags | __GFP_ZERO);

This will cause problems on 64-bit powerpc, at least with 4k pages,
since the pmd and pgd levels only use 1/4 of a page.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Greg KH
On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
> Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > 
> > well we can do the handshake to take ownership like we do much later in
> > boot, but that requires PCI to be there and fully discovered, which we
> > don't have this early.
> 
> That's not true - we do early pci discovery. Doing USB handsoff
> there would be quite possible.

What, we don't do USB "handoff" early enough in the boot process?  It's
happening at PCI quirk time now, which I think should be early enough
for everyone (and too early for some who rely on USB keyboards and
initramfs shells...)

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sysctl: vfs_cache_divisor

2007-03-19 Thread H. Peter Anvin

Randy Dunlap wrote:


The we duplicate all the relevant /proc knobs:

cat /proc/sys/vm/dirty_ratio
30
cat /proc/sys/vm/hires-dirty_ratio/
30

Or we do something else ;)


Sounds better.  I wasn't very keen on the userspace interface that this
exposed.  Will look at those.



Okay... may be I could throw a spanner in the machinery, and suggest 
another option: perhaps we should add a way to do sysctl which can 
handle fractional (fixed-point) values... more coherent/detailed message 
tomorrow.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Rusty Russell
On Mon, 2007-03-19 at 18:00 -0800, Zachary Amsden wrote:
> Rusty Russell wrote:
> > *This* was the reason that the current hand-coded calls only clobber %
> > eax.  It was a compromise between native (no clobbers) and others (might
> > need a reg).
> 
> I still don't think this was a good trade.
...
> Xen no longer needs such a register

Hmm, well, if VMI is happy, Xen is happy, and lguest is happy, then
perhaps we're better off with a cc-only clobber rule?  Certainly makes
life simpler.

> > Now, since we decided to allow paravirt_ops operations to be normal C
> > (ie. the patching is optional and done late), we actually push and pop %
> > ecx and %edx.  This makes the call site 10 bytes long, which is a nice
> > size for patching anyway (enough for a movl $0, , a-la lguest's
> > cli, or movw $0, %gs: if we supported SMP).
> 
> You can do it in 11 bytes with no clobbers and normal C semantics by 
> linking to a direct address instead of calling to an indirect, but then 
> you need some gross fixup technology in paravirt_patch:
> 
> if (call_addr == (void*)native_sti) {
>   ...
> }

Well, I don't think we need such hacks: since we have to use handcoded
asm and mark the callsites anyway, marking what they're calling is
trivial.

The other idea from "btfixup" is that we can do the patching *much*
earlier, so we don't need the initial code to be valid at all if we
wanted to: we just need room to patch in a call insn.  We could then
generate trampolines which do the necessary pushes & pops automatically
for backends which want to use C calling conventions.

Perhaps it's time for code and benchmarks?

Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Eric W. Biederman
David Miller <[EMAIL PROTECTED]> writes:

> From: Linus Torvalds <[EMAIL PROTECTED]>
> Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT)
>
>> > > Please don't subject us to another couple months of hair-pulling only
>> > > to have Linus yank the thing out again, there are certainly more
>> > > useful things to spend time on :-)
>> 
>> Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
>> out, I simply won't merge it. It was more than just totally buggy code, it 
>> was an inability of the people to understand that even bugfree code 
>> isn't enough - you have to be able to also handle buggy data.
>
> Thank you.

Hmm..

I know the feeling I have had a similar rant about the kexec on panic
code path.   The code is still no where near as paranoid about normal
kernel things not working as it could be, but by ranting about it
periodically the people doing the work are gradually making it better.

I'm conflicted about the dwarf unwinder.  I was off doing other things
at the time so I missed the pain, but I do have a distinct recollection of
the back traces on x86_64 being distinctly worse the on i386.  Lately
I haven't seen that so it may be I was misinterpreting what I was
seeing, and the compiler optimizations were what gave me such weird
back traces.  

But if the quality of our backtraces has gone down and dwarf unwinder
could give us better back traces it is likely worth pursuing.  Of
course it would need to start with the assumption that it's tables
may be borked (the kernel is busted after all) and be much more
careful than Andi's last attempt.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] [patch 29/31] Input: i8042 - fix AUX IRQ delivery check

2007-03-19 Thread Greg KH
On Mon, Mar 19, 2007 at 05:48:55PM -0400, Dmitry Torokhov wrote:
> On 3/19/07, Greg KH <[EMAIL PROTECTED]> wrote:
> > -stable review patch.  If anyone has any objections, please let us know.
> >
> > --
> >
> > From: Dmitry Torokhov <[EMAIL PROTECTED]>
> >
> > Input: i8042 - fix AUX IRQ delivery check
> >
> > On boxes that do not implement AUX LOOP command we can not
> > verify AUX IRQ delivery and must assume that it is wired
> > properly.
> >
> 
> Greg,
> 
> There is another piece missing in AUX delivery test, commit
> 
> 3ca5de6dd4ec5a139b2b8f00dce3e4726ca91af1
> 
> Unfortunately I can't send you a patch at the moment but if you could
> get it from the mainline that would be great.

Thanks for letting me know, I've added it to the queue now.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/31] Fix user copy length in ipv6_sockglue.c

2007-03-19 Thread Greg KH
On Mon, Mar 19, 2007 at 03:01:25PM -0700, Chris Wright wrote:
> * Greg KH ([EMAIL PROTECTED]) wrote:
> > From: Chris Wright <[EMAIL PROTECTED]>
> > 
> > [IPV6] fix ipv6_getsockopt_sticky copy_to_user leak
> > 
> > User supplied len < 0 can cause leak of kernel memory.
> > Use unsigned compare instead.
> 
> You can drop this one.  It's dependent on a patch
> that is not in 2.6.20.

Ok, thanks for letting me know, it is now dropped.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread David Miller
From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Mon, 19 Mar 2007 20:18:14 -0700 (PDT)

> > > Please don't subject us to another couple months of hair-pulling only
> > > to have Linus yank the thing out again, there are certainly more
> > > useful things to spend time on :-)
> 
> Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
> out, I simply won't merge it. It was more than just totally buggy code, it 
> was an inability of the people to understand that even bugfree code 
> isn't enough - you have to be able to also handle buggy data.

Thank you.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.22 3/3] Add LED trigger to libata core

2007-03-19 Thread Tejun Heo

Tony Vroon wrote:

The first user of ata_ac_issue_prot_with_ledtrigger, the ServerWorks Frodo/
Apple K2 driver. Used by the IDE LED trigger on G5 towers.
Respin of an earlier patch, based on comments by Tejun Heo & Alan Cox.


Just two comments.

1. IMHO, ata_qc_issue_prot_ledtrigger() without 'with' is good enough. 
This is just my personal preference.  Feel free to ignore it.


2. Patch #1 and #2 should be merged.  They're one logical change of 
adding ata_qc_issue_prot_with_ledtrigger().  Patch #3 is a logically 
separate change of using it, but unless it's a wide conversion, 
implementing something and using something can be merged.  So, please 
merge #1 and #2 and possibly #3.


Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Linus Torvalds


On Mon, 19 Mar 2007, Andi Kleen wrote:
> 
> Initially we had some bugs that accounted for near all failures, but they 
> were all fixed in the latest version.

No. The real bugs were that the people involved wouldn't even accept that 
unwinding information was inevitably buggy and/or incomplete.

That much more fundamental bug never got fixed, as far as I know. 

I'm not going to merge anything that depends on unwind tables as things 
stand. The pain just isn't worth it.

> > Please don't subject us to another couple months of hair-pulling only
> > to have Linus yank the thing out again, there are certainly more
> > useful things to spend time on :-)

Good call. Dwarf2 unwinding simply isn't worth doing. But I won't yank it 
out, I simply won't merge it. It was more than just totally buggy code, it 
was an inability of the people to understand that even bugfree code 
isn't enough - you have to be able to also handle buggy data.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL v0.31

2007-03-19 Thread Linus Torvalds


On Mon, 19 Mar 2007, Xavier Bestel wrote:
> > >> Stock scheduler wins easily, no contest.
> > > 
> > > What happens when you renice X ?
> > 
> > Dunno -- not necessary with the stock scheduler.
> 
> Could you try something like renice -10 $(pidof Xorg) ?

Could you try something as simple and accepting that maybe this is a 
problem?

Quite frankly, I was *planning* on merging RSDL very early after 2.6.21, 
but there is one thing that has turned me completely off the whole thing:

 - the people involved seem to be totally unwilling to even admit there 
   might be a problem.

This is like alcoholism. If you cannot admit that you might have a 
problem, you'll never get anywhere. And quite frankly, the RSDL proponents 
seem to be in denial ("we're always better", "it's your problem if the old 
scheduler works better", "just one report of old scheduler being better").

And the thing is, if people aren't even _willing_ to admit that there may 
be issues, there's *no*way*in*hell* I will merge it even for testing. 
Because the whole and only point of merging RSDL was to see if it could 
replace the old scheduler, and the most important feature in that case is 
not whether it is perfect, BUT WHETHER ANYBODY IS INTERESTED IN TRYING TO 
FIX THE INEVITABLE PROBLEMS!

See?

Can you people not see that the way you're doing that "RSDL is perfect" 
chorus in the face of people who report problems, you're just making it 
totally unrealistic that it will *ever* get merged.

So unless somebody steps up to the plate and actually *talks* about the 
problem reports, and admits that maybe RSDL will need some tweaking, I'm 
not going to merge it.

Because there is just _one_ thing that is more important than code - and 
that is the willingness to fix the code...

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Nick Piggin

Andrew Morton wrote:

On Mon, 19 Mar 2007 17:58:52 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote:



The kernel without Nick's patchset but with the assert runs OK too.  Under
the principle of mm-has-been-too-flakey-lately, I'll drop the patches:

mm-debug-check-for-the-fault-vs-invalidate-race.patch
mm-simplify-filemap_nopage.patch
mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-tidy.patch
mm-merge-nopfn-into-fault.patch
mm-merge-nopfn-into-fault-fix.patch
mm-remove-legacy-cruft.patch



ug, too many rejects.  I'll leave them in, minus
mm-debug-check-for-the-fault-vs-invalidate-race.patch



Hang on a sec... I'll try fixing the thing before you next make a
release.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ignore this posting

2007-03-19 Thread David Miller

Just trying to generate an example bounce so Intel can fix
their attachment email filters, ignore me.

#!/bin/sh
#
# Usage: git suck path-to-tree
#
# Pull all patches relative to 'origin' from the tree specified
# and apply them to the current directory tree, keeping all changelog
# and authorship information identical.  It will update the dates
# of the changes of course.
(cd $1; git format-patch --suffix=.txt origin) || exit 1
for i in $1/*.txt
do
   sed 's/\[PATCH\] //' <$i >tmp.patch
   git-applymbox -k tmp.patch || exit 1
done
rm -f tmp.patch


RE: UDP packets scheduling

2007-03-19 Thread David Schwartz

> can anyone suggest me a proper way how to schedule UDP packets to
> transmit at
> some given rate?
>
> E.g., I have two boxes both having 10 GE interfaces. One box is able to
> transmit at 9.9Gbps, the other one is able to receive only at
> about 5.5Gbps.
> Flow control must be turned off for some other reason.

UDP is not a very good choice of protocol for this purpose. UDP pushes the
transmit timing job into user-space, where it cannot be done particularly
well.

> How can I put delay between subsequent msg sends to achieve desired
> packet rate without loses, e.g., 3.5Gbps without bursts? Even nanosleep()
> with the lowest possible delay seems to be too much delay. Busy loop with
> clock_gettime(3) works OK on SMP boxes, but on UP it causes problems.

Why do you want to avoid bursts? You're going to be bursting between 10Gb/s
and 0 anyway.

It sounds like you're deliberately putting impossible requirements on
yourself choosing the worst possible protocol and demanding the pacing be
perfect. I don't think the technology to do that is here yet, but why would
you possibly need it?

10GE cards tend to have large buffers precisely because it's not possible to
get the timing even. Why is burstiness a problem?

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [6/6] 2.6.21-rc4: known regressions

2007-03-19 Thread David Miller
From: Adrian Bunk <[EMAIL PROTECTED]>
Date: Sun, 18 Mar 2007 19:49:38 +0100

> Subject: ipv6 crash
> References : http://lkml.org/lkml/2007/3/10/2
> Submitter  : Len Brown <[EMAIL PROTECTED]>
> Status : unknown

This is caused by some problem in the router round-robin code in
net/ipv6/route.c:rt6_select()

Somehow it NULLs out fn->leaf, and then fib6_add_1() crashes
dererencing that NULL pointer as is seen in the report.

Deleting the router round-robin list mangling code in rt6_select()
makes the crash go away, but such a change causes regressions in the
ipv6 conformance tests.

Thomas Graf discovered this bug some time ago, but we still
haven't come up with a fix suitable for upstream :-/

This bug has been there for a very long time and is not a regression
of 2.6.21

I'll see if I can come up with something to fix this properly.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-19 Thread Zhang, Yanmin
On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote:
> On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote:
> > I would agree that it points to MySQL scalability issues, however the
> > fact that such large gains come from tcmalloc is still interesting.
> 
> What glibc version are you, Anton and others are using?
> 
> Does that version has this fix included?
> 
> Dynamically size mmap treshold if the program frees mmaped blocks.
> 
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158=1.159=glibc
> 
Last week, I reproduced it on RHEL4U3 with glibc 2.3.4-2.19. Today, I
installed RHEL5GA and reproduced it again. RHEL5GA uses glibc 2.5-12 which
already includes the dynamically size mmap threshold patch, so this patch 
doesn’t
resolve the issue.

The problem is really relevant to malloc/free of glibc multi-thread.

My paxville has 16 logical cpu (dual core+HT). I disabled HT by hot
removing the last 8 logical processors.

I captured the schedule status. When sysbench thread=8 (best performance),
there are about 3.4% context switches caused by __down_read/__down_write_nested.
When sysbench thread=10 (best performance), the percentage becomes 11.83%.

I captured the thread status by gdb. When sysbench thread=10, usually 2 threads
are calling mprotect/mmap. When sysbench thread=8, there are no threads calling
mprotect/mmap. Such capture has random behavior, but I tried for many times.

I think the increased percentage of context switch related to
__down_read/__down_write_nested is caused by mprotect/mmap. mprotect/mmap
accesses the semaphore of vm, so there are some contentions on the sema which
make performance down.

The strace shows mysqld often calls mprotect/mmap with the same data length
61440. That’s another evidence. Gdb showed such mprotect is called by
init_io_malloc=>my_malloc=>malloc=>init_malloc=>mprotect. Mmap is caused by
__init_free=>mmap. I checked the source codes of glibc and found the real call
chains are malloc=>init_malloc=>grow_heap=>mprotect and 
__init_free=>heap_trim=>mmap.

I guess the transaction processing of mysql/sysbench is: mysql accepts a 
connection
and initiates a block for the connection. After processing a couple of 
transactions,
sysbench closes the connection. Then, restart the procedure.

So why are there so many mprotect/mmap?

Glibc uses arena to speedup malloc/free at multi-thread environment.
mp.trim_threshold only controls main_arena. In function __init_free,
FASTBIN_CONSOLIDATION_THRE might be helpful, but it’s a fixed value.

The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena.

To verify my idea, I created a small patch. When freeing a block, always
check mp_.trim_threshold even though it might not be in main arena. The
patch is just to verify my idea instead of the final solution.

--- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.0 
+0800
+++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.0 
+0800
@@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem)
   } else {
/* Always try heap_trim(), even if the top chunk is not
   large, because the corresponding heap might go away.  */
+   if ((unsigned long)(chunksize(av->top)) >=
+   (unsigned long)(mp_.trim_threshold)) {
heap_info *heap = heap_for_ptr(top(av));
 
assert(heap->ar_ptr == av);
heap_trim(heap, mp_.top_pad);
+   }
   }
 }
 

With the patch, I recompiled glibc and reran sysbench/mysql. The result is good.
When thread number is larger than 8, the tps and response time(avg) are smooth, 
and
don't drop severely.

Is there anyone being able to test it on AMD machine?

Yanmin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: MPT Fusion LSI22320 , Domain validation loops .

2007-03-19 Thread Moore, Eric
On Saturday, March 17, 2007 2:33 PM,  James W. Laferriere wrote:
>   Hello All ,  I am have been having this problem since I 
> purchased the 
> controller and after changing out the disks I thought were 
> the problem .
>   I am still getting the continous :
> 
> mptscsih: ioc1: attempting task abort! (sc=f7a64500)
> scsi 3:0:4:0:
>  command: Inquiry: 12 00 00 00 60 00
> mptbase: Initiating ioc1 recovery
> mptscsih: ioc1: task abort: SUCCESS (sc=f7a64500)
>   target3:0:4: Domain Validation detected failure, dropping back
>   target3:0:4: Domain Validation skipping write tests
>   target3:0:4: Ending Domain Validation
>   target3:0:4: asynchronous
>   target3:0:5: Beginning Domain Validation
> mptscsih: ioc0: attempting target reset! (sc=f7a64380)
> 
>   The acutual device id's change and the driver 
> continously resets the 
> busses & starts all over .
> 
>   The disks are in a HP DS-SL13R-BA 4354R 14drive ultra3 
> racKmount cabinet 
> w/ dualbus & dualps ,  Which seems to present a ID6 ,  That 
> does not show up in 
> any of the bus scans .
> 
>   Now I have previously had the same cabinet with 18gb 
> disks which had the 
> same problem with this controller .  BUT I also have a LSI 
> Logic / Symbios 
> Logic 53c1010 66MHz Ultra3 dual SCSI bus Adapter which works 
> flawlessly with the 
> 18gb disks in this very same cabinet .
>   The cables for connecting the adapter(s) to tha cabinet 
> are less than 24 
> inches in length .
> 
>   Would anyone please shed some light on what it is I am 
> doing wrong or 
> need to do or ?  Too have this controller recognise these 
> disk drives in 
> this cabinet .

There is a seperate mailing list for scsi releated issues, e.g.
[EMAIL PROTECTED]   I've posted a patch to address your issue several times,
however it seems its not been picked up by the scsi subsystem
maintainer.   The last time it was posted was here:
http://marc.info/?l=linux-scsi=117089244809072=2   An alternative is
you could obtain our latest drivers from the LSI download site, where
these drivers should have this patch
http://www.lsilogic.com/cm/DownloadSearch.do.

Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Andrew Morton
On Mon, 19 Mar 2007 17:58:52 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote:

> The kernel without Nick's patchset but with the assert runs OK too.  Under
> the principle of mm-has-been-too-flakey-lately, I'll drop the patches:
> 
> mm-debug-check-for-the-fault-vs-invalidate-race.patch
> mm-simplify-filemap_nopage.patch
> mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-tidy.patch
> mm-merge-nopfn-into-fault.patch
> mm-merge-nopfn-into-fault-fix.patch
> mm-remove-legacy-cruft.patch

ug, too many rejects.  I'll leave them in, minus
mm-debug-check-for-the-fault-vs-invalidate-race.patch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Zachary Amsden

Rusty Russell wrote:

On Mon, 2007-03-19 at 11:38 -0700, Linus Torvalds wrote:
  

On Mon, 19 Mar 2007, Eric W. Biederman wrote:


True.  You can use all of the call clobbered registers.
  
Quite often, the biggest single win of inlining is not so much the code 
size (although if done right, that will be smaller too), but the fact that 
inlining DOES NOT CLOBBER AS MANY REGISTERS!



For VMI, the default clobber was "cc", and you need a way to allow at 
least that, because saving and restoring flags is too expensive on x86.



Thanks Linus.

*This* was the reason that the current hand-coded calls only clobber %
eax.  It was a compromise between native (no clobbers) and others (might
need a reg).
  


I still don't think this was a good trade.  The primary motivation for 
clobbering %eax was that Xen wanted a free register to use for computing 
the offset into the shared data in the case of SMP preemptible kernels.  
Xen no longer needs such a register, they can use the PDA offset 
instead.  And it does hurt native performance by unconditionally 
stealing a register in the four most commonly invoked paravirt-ops code 
sequences.



Now, since we decided to allow paravirt_ops operations to be normal C
(ie. the patching is optional and done late), we actually push and pop %
ecx and %edx.  This makes the call site 10 bytes long, which is a nice
size for patching anyway (enough for a movl $0, , a-la lguest's
cli, or movw $0, %gs: if we supported SMP).
  


You can do it in 11 bytes with no clobbers and normal C semantics by 
linking to a direct address instead of calling to an indirect, but then 
you need some gross fixup technology in paravirt_patch:


if (call_addr == (void*)native_sti) {
 ...
}

I think we should probably try to do it in 12 bytes.  Freeing eax to the 
inline caller is likely to make up the 2 bytes of space more we have to nop.


One thing I always tried to get in VMI was to encapsulate the actual 
code which went through the business of computing arguments that were 
not even used in the native case.  Unfortunately, that seems impossible 
in the current design, but I don't think it is an issue because I don't 
think there is actually a way to express:


SWITCHABLE_CODE_BLOCK_BEGIN {
  /* arbitrary C code for native */
} SWITCHABLE_CODE_BLOCK_ALTERNATIVE {
  /* arbitrary C code for something else */
}

Dave's linker suggestion is probably the best for things like that.

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm snapshot broken-out-2007-03-18-02-44.tar.gz uploaded

2007-03-19 Thread Andrew Morton
On Mon, 19 Mar 2007 22:37:46 +0100 "Michal Piotrowski" <[EMAIL PROTECTED]> 
wrote:

> On 19/03/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > On Mon, 19 Mar 2007 20:23:40 +0100
> > Michal Piotrowski <[EMAIL PROTECTED]> wrote:
> >
> > > [EMAIL PROTECTED] napisał(a):
> > > > The mm snapshot broken-out-2007-03-18-02-44.tar.gz has been uploaded to
> > > >
> > > >
> > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-03-18-02-44.tar.gz
> > > >
> > > > It contains the following patches against 2.6.21-rc4:
> > > >
> > >
> > > [ cut here ]
> > > kernel BUG at mm/filemap.c:123!
> > > invalid opcode:  [#1]
> > > PREEMPT SMP
> > > last sysfs file: devices/platform/w83627hf.656/temp2_input
> > > Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nfsd exportfs lockd 
> > > nfs_acl autofs4 sunrpc af_packet nf_conntrack_netbios_ns ipt_REJECT 
> > > nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink iptable_filter 
> > > ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 
> > > binfmt_misc thermal processor fan container nvram snd_intel8x0 
> > > snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event 
> > > snd_seq snd_seq_device snd_pcm_oss evdev snd_mixer_oss snd_pcm intel_agp 
> > > agpgart snd_timer snd soundcore i2c_i801 snd_page_alloc ide_cd cdrom rtc 
> > > unix
> > > CPU:0
> > > EIP:0060:[]Not tainted VLI
> > > EFLAGS: 00010002   (2.6.21-rc4-mm1 #13)
> > > EIP is at __remove_from_page_cache+0x42/0x4a
> > > eax: 0001   ebx: ca263a58   ecx: c043c968   edx: 0001
> > > esi: c6ad3480   edi:    ebp: c968dde8   esp: c968dde0
> > > ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
> > > Process bash-shared-map (pid: 12273, ti=c968c000 task=c78bc030 
> > > task.ti=c968c000)
> > > Stack: ca263a68 c6ad3480 c968ddf8 c016161b c6ad3480 00da c968de04 
> > > c016824d
> > >c6ad3480 c968de88 c0168525 1000   d17dc000 
> > > 0005a91a
> > > ca263a58 005b  091a 0110 c54eb5e0 
> > > 0004
> > > Call Trace:
> > >  [] show_trace_log_lvl+0x1a/0x2f
> > >  [] show_stack_log_lvl+0x9d/0xac
> > >  [] show_registers+0x1ed/0x34c
> > >  [] die+0x11d/0x234
> > >  [] do_trap+0x8a/0xa3
> > >  [] do_invalid_op+0x97/0xa1
> > >  [] error_code+0x7c/0x84
> > >  [] remove_from_page_cache+0x35/0x40
> > >  [] truncate_complete_page+0x38/0x42
> > >  [] truncate_inode_pages_range+0x2ce/0x2fe
> > >  [] truncate_inode_pages+0x1a/0x1c
> > >  [] vmtruncate+0x40/0xbb
> > >  [] inode_setattr+0x5c/0x137
> > >  [] ext3_setattr+0x19c/0x1f8
> > >  [] notify_change+0x139/0x2ec
> > >  [] do_truncate+0x53/0x6c
> > >  [] do_sys_ftruncate+0x135/0x150
> > >  [] sys_ftruncate64+0x1b/0x1d
> > >  [] syscall_call+0x7/0xb
> >
> > Ugly - it's hard to determine which patch might have caused that, but I
> > bet it was Nick ;)
> >
> > How hard is it to reproduce?
> 
> I think that it's very easy - run bash_shared_mapping from AutoTest
> for a few seconds.
> 

Yeah, a simple `bash-shared-mapping foo 1' goes splat after a few
seconds.

Which indicates that the patchset just isn't working as intended, I think. 
Nick, did you ever run bash-shared-mapping on it?  You should - it's kinda
evil.

I could just drop the BUG_ON, or I could drop the whole patch series.



The kernel with Nick's patchset but without the assert seems to run OK. 
But presumably it's anonymising mapped pages, which is bad.

The kernel without Nick's patchset but with the assert runs OK too.  Under
the principle of mm-has-been-too-flakey-lately, I'll drop the patches:

mm-debug-check-for-the-fault-vs-invalidate-race.patch
mm-simplify-filemap_nopage.patch
mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-tidy.patch
mm-merge-nopfn-into-fault.patch
mm-merge-nopfn-into-fault-fix.patch
mm-remove-legacy-cruft.patch

A rollup against rc4 which includes the above patches and which is suitable
for raising fixups against is at http://userweb.kernel.org/~akpm/np.gz


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
> Jeremy Fitzhardinge wrote:
>>  If we then work out in each direction and see matched push/pops,
>> then we know what registers can be trashed in the call.  This also
>> allows us to determine the callsite size, and therefore how much space
>> we need for inlining.
>>   
>
> No, that is a very dangerous suggestion.  You absolutely *cannot* do
> this safely without explicitly marking the start EIP of this code. 
> You *must* use metadata to do that.  It is never safe to disassemble
> backwards or "rewind" EIP for x86 code. 

What do you mean the instruction before is "mov $0x52515000,%eax"?

Yeah, you're right.  Oh well.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] no boot with 2.6.21-rc3 and later

2007-03-19 Thread Bob Tracy
Jiri wrote:
> Looks like it's related to some change in drivers/ide. As there have been 
> only 13 patches in this area between rc2 and rc3, it should take only 3 or 
> 4 reboots to figure the offending patch using git-bisect - could you 
> please give it a try?

I applied all of the 2.6.21-rc2-rc3 incremental patch except for the
portion applicable to "drivers/ide" files.  The problem seems to be
elsewhere: 2.6.21-rc3 minus the drivers/ide changes still hangs at the
same spot during the boot process.

Any ideas where to look next?  Thanks!

-- 
---
Bob Tracy   WTO + WIPO = DMCA? http://www.anti-dmca.org
[EMAIL PROTECTED]
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/13] signal/timer/event fds v7 - anonymous inode source ...

2007-03-19 Thread Davide Libenzi
On Tue, 20 Mar 2007, Thomas Gleixner wrote:

> > +   error = -ENFILE;
> > +   file = get_empty_filp();
> > +   if (!file)
> > +   goto eexit_1;
> 
> make this "return -ENFILE;" please

Done


> > +   inode = aino_getinode();
> > +   if (IS_ERR(inode)) {
> > +   error = PTR_ERR(inode);
> > +   goto eexit_2;
> 
> Can you please use a bit more descriptive labels ?
> 
> e.g:
>   goto out_filp;

Done


> > +static int ainofs_delete_dentry(struct dentry *dentry)
> > +{
> > +   /*
> > +* We faked vfs to believe the dentry was hashed when we created it.
> > +* Now we restore the flag so that dput() will work correctly.
> > +*/
> > +   dentry->d_flags |= DCACHE_UNHASHED;
> > +   return 1;
> > +}
> 
> Please put either "struct ainofs_dentry_operations ..." below the next
> function or move ainofs_delete_dentry() above "struct
> ainofs_dentry_operations ..."
> 
> It's annoying to lookup the protoypes and implemenation back and forth.

I prefer to have all data declarations at the beginning. but if you can 
manage to have that requirement in the Coding Style, I'll change it ;)



> > +static struct inode *aino_getinode(void)
> > +{
> > +   return igrab(aino_inode);
> > +}
> 
> Please use "igrab(aino_inode);" directly in this one single place above.
> That saves us a prototype and an useless static function with no value.

Done



> > +/*
> > + * A single inode exist for all aino files. On the contrary of pipes,
> > + * aino inodes has no per-instance data associated, so we can avoid
> > + * the allocation of multiple of them.
> > + */
> > +static struct inode *aino_mkinode(void)
> > +{
> > +   int error = -ENOMEM;
> > +   struct inode *inode = new_inode(aino_mnt->mnt_sb);
> > +
> > +   if (!inode)
> > +   goto eexit_1;
> 
>   return ERR_PTR(-ENOMEM);

Done


> > +   aino_mnt = kern_mount(_fs_type);
> > +   if (IS_ERR(aino_mnt))
> > +   goto epanic;
> > +
> > +   aino_inode = aino_mkinode();
> > +   if (IS_ERR(aino_inode))
> > +   goto epanic;
> > +
> > +   return 0;
> > +
> > +epanic:
> > +   panic("aino_init() failed\n");
> 
> Panic ? It's not life critical - is it ? 
> 
> A printk(KERN_ERR...) and a return -Exx would be sufficient.

Done.




- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/13] signal/timer/event fds v7 - timerfd core ...

2007-03-19 Thread Davide Libenzi
On Tue, 20 Mar 2007, Eric Dumazet wrote:

> Davide Libenzi a écrit :
> 
> > +struct timerfd_ctx {
> > +   struct hrtimer tmr;
> > +   ktime_t tintv;
> > +   spinlock_t lock;
> > +   wait_queue_head_t wqh;
> > +   unsigned long ticks;
> > +};
> 
> > +static struct kmem_cache *timerfd_ctx_cachep;
> 
> > +   timerfd_ctx_cachep = kmem_cache_create("timerfd_ctx_cache",
> > +   sizeof(struct timerfd_ctx),
> > +   0, SLAB_PANIC, NULL, NULL);
> 
> 
> Do we really expect thousands of active timerfd_ctx ?
> 
> If not, using kmalloc()/kfree() would be fine, because sizeof(struct
> timerfd_ctx) is so small.
> 
> on SMP / NUMA platforms, each new kmem_cache is rather expensive. (memory
> allocated at kmem_cache_create(), but also memory used when cache is not
> empty, with slabs in freelist for each cpu/node)
> 
> Using a general cache might be cheaper : No memory overhead for yet another
> kmem_cache.
> 
> I know individual caches are good to spot memory leaks, but in timerfd case,
> you dont have mem leaks, do you ? :)

Silly you, of course not :)
Yes, I gues I can use kmalloc/kfree for those fds ...



- Davide



Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Andrew Morton
On Mon, 19 Mar 2007 18:03:54 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Mon, 19 Mar 2007, Andrew Morton wrote:
> 
> > > See the patch. We are only touching 2 cachelines instead of 32. So even 
> > > without considering the page allocator overhead and the slab allocator 
> > > overhead (which will make the situation even better) its superior.
> > 
> > That's not proof, it is handwaving.  I could wave right back at you and
> > claim that the benefit from returning a cache-hot pte page back to the page
> > allocator for reuse exceeds the benefit which you waved at me above.
> 
> No you cannot make that claim. That would mean that you have to touch 
> 32 pages which is inferior.

For pte pages (which are far more common), more than a single cacheline
will be in cache.

Yes, a common quicklist implementation is good.  But no quicklist
implementation at all is better.  You say that will be slower, and you may
well be right, but I say let's demonstrate that (please) rather than
speculating.

Then we can look at the difference and decide whether it is worth the
additional complexity of this special-purpose private allocator.

> > You may well be right, but nothing is proven, afaict.
> 
> Nothing can be proven except within a rigorously defined mathematical 
> system but even there we are limited by such things as Russel's paradox.
> 
> Its obvious that this is right. And there has been significant work 
> invested into retaining page table pages on i386, sparc64 and ia64 for 
> exactly the specified.

I believe that work predated per-cpu-pages.

> This patch does not change that at all for these 3 
> arches. There is no doubt about the correctness of the approach here.
> 
> > > You do not think that our current way of handling ptes is okay? If we do 
> > > not zero the ptes then we need to separate munmap from process shutdown.
> > 
> > Yep.  It's possible that process shutdown is a sufficiently common and
> > costly special-case for it to be worth special-casing.
> 
> Ok great idea but what does this have to do with this patch? This patch 
> simply generalizes something that has been there for ages.

It has a lot to do with this patch.

If we decide that it is useful to optimise the full-mm teardown case then
we will need to zero these pages when we start to use them so we might as
well get them straight from the page allocator.  Hence this patch goes into
the bitbucket.

> > > The advantage of the quicklists is that it does not require a rework of 
> > > the pte serialization.
> > 
> > No, these are unrelated.  We can get pte pages from the page allocator and
> > zero them without touching the munmap handling.
> > 
> > But it's possible that if we _were_ to optimise the munmap handling as
> > suggested, the end result would be superior.
> 
> Andrew, this is utter crap and unrelated to this work. The main thing here 
> is to generalize something that various arches already do and to avoid the 
> page struct handling collisions. You use pie-in-the-sky to argue against 
> consolidating code and fixing up usage conflicts of the slab with arch 
> code?

It is not pie-in-the-sky to ask "is this code still useful?".

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Zachary Amsden

Jeremy Fitzhardinge wrote:

For example, say we wanted to put a general call for sti into entry.S,
where its expected it won't touch any registers.  In that case, we'd
have a sequence like:

push %eax
push %ecx
push %edx
call paravirt_cli
pop %edx
pop %ecx
pop %eax
  

If we parse the relocs, then we'd find the reference to paravirt_cli. 
If we look at the byte before and see 0xe8, then we can see if its a

call.  If we then work out in each direction and see matched push/pops,
then we know what registers can be trashed in the call.  This also
allows us to determine the callsite size, and therefore how much space
we need for inlining.
  


No, that is a very dangerous suggestion.  You absolutely *cannot* do 
this safely without explicitly marking the start EIP of this code.  You 
*must* use metadata to do that.  It is never safe to disassemble 
backwards or "rewind" EIP for x86 code.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Matt Mackall
On Tue, Mar 20, 2007 at 01:42:46AM +0100, Thomas Gleixner wrote:
> On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
> > > > If a static volume is simply a non-dynamic volume, then device mapper
> > > > can do that too. And countless other things. Which is not an aside.
> > > > UBI growing to do all the things that device mapper does is exactly
> > > > the thing we should be seeking to avoid.
> > > 
> > > No it can't and device mapper sits on top of block devices. FLASH is no
> > > block device. Period.
> > 
> > Which of the following two properties does it lack?
> > 
> > - discrete blocks
> > - non-sequential access to blocks
> > 
> > When you do the obvious s/blocks/eraseblocks/, this appears to be
> > true.
> 
> It appears to be, but it is not. You enforce semantics on a device,
> which it does not have.
> 
> > Saying "but I can't do I/O smaller than the blocksize" doesn't change
> > this any more than it would for disks.
> 
> There is a huge difference. Disk block size is 512 byte and FLASH block
> size is min 16KiB and up to 256KiB.
> 
> Just do the math:
> 
> Write sampling data streams in 2KiB chunks to your uber devicemapper on
> a 1GiB device with 64KiB erase block size:
> 
> Fine grained FLASH aware writes allow 32 chunks in a block without
> erasing the block.
> 
> Your method erases the block 32 times to write the same amount of data.

Sigh. That's the current /dev/mtdblock method, not my method. You're too
fixated on what you think I'm saying to hear what I'm saying.

> > Saying "but I can do smaller I/O efficiently in some circumstances"
> > also doesn't change it.
> 
> We can do it under _any_ circumstances and that _does_ change it.
> Implementing a clever block device layer on top of UBI is simple and
> would provide FLASH page sized I/O, i.e. 2Kib in the above example.

Yes. I know. I've written a complete (non-Linux) FTL. I know what's
entailed.
 
> > In historical UNIX, some tapes were block devices too. Because they
> > supported seek().
> 
> I'm impressed. How exactly are "some tapes" comparable to FLASH chips ?
> 
> Your next proposal is to throw away MTD-utils and use "mt" instead ?

Don't be an ass. I'm pointing out that not all block devices are disks.
 
> > > Device mapper can not provide a simple easy to decode scheme for boot
> > > loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> > > and be able to find the kernel or second stage boot loader in this
> > > unordered device.
> > > 
> > > And no, fixed addresses do not work. Do you want to implement device
> > > mapper into your Initialial Bootloader stage ?
> > 
> > This is exactly the same problem as booting on a desktop PC. But
> > somehow LILO manages. My first Linux box had a hell of a lot less disk
> > than the platform I bootstrapped (and wrote NAND drivers for) last
> > month had in NAND.
> 
> No, it is not. You get the absolute sector address of your second stage
> and this is a complete nobrainer. The translation is done in the DISK
> device.

LILO and friends manage to boot systems that use software RAID and
LVM. There are multiple methods. Some use block lists, some use tiny
boot partitions, etc. All of them are applicable to controllerless NAND.

> You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
> whatever - there is a more or less intellegent controller device, which
> does the mapping to the physical storage location. There is _NO_ such
> thing on a bare FLASH chip.

How many times do I have to tell you that I wrote a driver for
controllerless NAND just last month?

> How exactly does device mapper:
> 
> A) across device wear levelling ?

The same way UBI does, but encapsulated in a device mapper layer.

> B) dynamic partitioning for FLASH aware file systems ?

See above.

> C) across device wear levelling for FLASH aware file systems ?

See above.

> D) background bit-flip corrections (copying affected blocks and recylce
> the old one) ?

See above.

> E) allow position independent placement of the second stage bootloader ?

See way above to my LILO response.

> > > You need to implement a clever journalling block device
> > > emulator in order to keep the data alive and the FLASH not weared out
> > > within no time. You need the wear levelling, otherwise you can throw
> > > away your FLASH in no time.
> > 
> > And that's why it's in my picture.
> 
> Yes, it is in your picture, but:
> 
> 1) it excludes FLASH aware file systems and UBI does not.
> 2) your picture does still not explain how it does achive the above A),
> B), C), D) and E)
> 
> Your extra path for partitioning(4) and JFFS2 is just a weird hack,
> which makes your proposal completely absurd.

No, it's just there to show the flexibility of device mapper. But I have
the sneaking suspicion you have no idea how device mapper works.

In brief: device mapper takes one or more devices, applies a mapping
to them, and returns a new device. For example, take various spans of
/dev/hda1 and /dev/sda3 and 

Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 22:51 +0100, Stefan Prechtel wrote:
> 2007/3/19, Thomas Gleixner <[EMAIL PROTECTED]>:
> > On Mon, 2007-03-19 at 21:35 +0100, Stefan Prechtel wrote:
> > >CPU0   CPU1
> > >  0:  28289  0  local-APIC-edge-fasteio   timer
> > > ...
> > > LOC:  28237  28236
> > >
> > > after a read: (I hope that is this what you want :-)
> > >CPU0   CPU1
> > >   0:  30344  0  local-APIC-edge-fasteio   timer
> > > ...
> > > LOC:  30292  30291
> >
> > Is this with AC plugged in ? If yes, please provide the same numbers for
> > battery mode.
> 
> Yes. And here is the output for battery mode (2.6.20):
>CPU0   CPU1
>   0: 292153  0  local-APIC-edge-fasteio   timer
> LOC: 292114 292113
> 
>CPU0   CPU1
>   0: 293263  0  local-APIC-edge-fasteio   timer
> LOC: 293224 293223

Hmm. Can you please apply the following patch on top of 2.6.20 and
check, if the WARN_ON_ONCE triggers when you boot w/o AC plugged ?

Thanks,

tglx

Index: linux-2.6.20/arch/i386/kernel/apic.c
===
--- linux-2.6.20.orig/arch/i386/kernel/apic.c
+++ linux-2.6.20/arch/i386/kernel/apic.c
@@ -1174,6 +1174,8 @@ void switch_APIC_timer_to_ipi(void *cpum
cpumask_t mask = *(cpumask_t *)cpumask;
int cpu = smp_processor_id();
 
+   WARN_ON_ONCE(1);
+
if (cpu_isset(cpu, mask) &&
!cpu_isset(cpu, timer_bcast_ipi)) {
disable_APIC_timer();


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Christoph Lameter
On Mon, 19 Mar 2007, Andrew Morton wrote:

> > +
> > +#ifdef CONFIG_QUICKLIST
> > +
> > +#ifndef CONFIG_NR_QUICK
> > +#define CONFIG_NR_QUICK 1
> > +#endif
> 
> No, please don't define config items like this.  Do it in Kconfig.

They can be set up in the arch specific Kconfig. Ok. I moved the 
#ifndef .. #endif into mm/Kconfig.

> These guys seem to have multiple callsites for ia64 at least and probably
> would benefit from being uninlined.

Then they would no longer be optimizable. Right now one can compile out 
the constructor / destructor support and provide a constant list number 
as well as constant gfp masks. This can be very small and benefit 
tremendously from inlining.

Many arches do not need some features and there are only a few call 
sites.

> > +void quicklist_check(int nr, void (*dtor)(void *));
> > +unsigned long quicklist_total_size(void);
> > +
> > +#else
> > +void quicklist_check(int nr, void (*dtor)(void *))
> > +{
> > +}
> > +
> > +unsigned long quicklist_total_size(void)
> > +{
> > +   return 0;
> > +}
> > +#endif
> 
> That obviouslty won't link and wasn't tested.  Making these static inline
> will help.

Hmmm... We could drop these conmpletely. If an arch does not use 
quicklists then they should not be calling these.

> > +#include 
> > +#include 
> > +
> > +DEFINE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK];
> 
> If we uninline those big inlines, this can perhaps be made static.

Yeah but we want the inlines.

> 
> > +#define MIN_PAGES  25
> > +#define MAX_FREES_PER_PASS 16
> > +#define FRACTION_OF_NODE_MEM   16
> 
> Are these constants optimal for all architectures?

I added them as parameters to quicklist_trim so that an arch 
can specify their own settings.

> > +   return min(pages_to_free, (long)MAX_FREES_PER_PASS);
> > +}
> 
> min_t and max_t are the standard way of avoiding that warning.  Or stick a
> UL on the constants (which is probably better).

We do not need those since the constants are now parameters.

> 
> > +void quicklist_check(int nr, void (*dtor)(void *))
> > +{
> > +   long pages_to_free;
> > +   struct quicklist *q;
> > +
> > +   q = _cpu_var(quicklist)[nr];
> > +   if (q->nr_pages > MIN_PAGES) {
> > +   pages_to_free = min_pages_to_free(q);
> > +
> > +   while (pages_to_free > 0) {
> > +   void *p = quicklist_alloc(nr, 0, NULL);
> > +
> > +   if (dtor)
> > +   dtor(p);
> > +   free_page((unsigned long)p);
> > +   pages_to_free--;
> > +   }
> > +   }
> > +   put_cpu_var(quicklist);
> > +}
> 
> The use of a literal 0 as a gfp_t is a bit ugly.  I assume that we don't
> care because we should never actually call into the page allocator for this
> caller.  But it's not terribly clear because there is no commentary
> describing what this function is supposed to do.

Right. Will add comments.

> The name foo_check() is unfortunate: it implies that the function checks
> something (ie: has no side-effects).  But this function _does_ change
> things and perhaps should be called quicklist_trim() or something like
> that.

Tradition. Dave initially named it check_pgt_cache it seems.
 
> This function lacks any commentary, but I was able to work it out.  I
> think.  Some nice comments would be, umm, nice.

ok. Here is a fixup patch:

Index: linux-2.6.21-rc3-mm2/include/linux/quicklist.h
===
--- linux-2.6.21-rc3-mm2.orig/include/linux/quicklist.h 2007-03-19 
17:41:42.0 -0700
+++ linux-2.6.21-rc3-mm2/include/linux/quicklist.h  2007-03-19 
17:47:34.0 -0700
@@ -13,10 +13,6 @@
 
 #ifdef CONFIG_QUICKLIST
 
-#ifndef CONFIG_NR_QUICK
-#define CONFIG_NR_QUICK 1
-#endif
-
 struct quicklist {
void *page;
int nr_pages;
@@ -77,18 +73,11 @@ static inline void quicklist_free(int nr
put_cpu_var(quicklist);
 }
 
-void quicklist_check(int nr, void (*dtor)(void *));
-unsigned long quicklist_total_size(void);
+void quicklist_trim(int nr, void (*dtor)(void *),
+   unsigned long min_pages, unsigned long max_free);
 
-#else
-void quicklist_check(int nr, void (*dtor)(void *))
-{
-}
+unsigned long quicklist_total_size(void);
 
-unsigned long quicklist_total_size(void)
-{
-   return 0;
-}
 #endif
 
 #endif /* LINUX_QUICKLIST_H */
Index: linux-2.6.21-rc3-mm2/mm/Kconfig
===
--- linux-2.6.21-rc3-mm2.orig/mm/Kconfig2007-03-19 17:41:42.0 
-0700
+++ linux-2.6.21-rc3-mm2/mm/Kconfig 2007-03-19 17:42:49.0 -0700
@@ -220,3 +220,7 @@ config DEBUG_READAHEAD
 
  Say N for production servers.
 
+config NR_QUICK
+   depends on QUICKLIST
+   default 1
+
Index: linux-2.6.21-rc3-mm2/mm/quicklist.c
===
--- linux-2.6.21-rc3-mm2.orig/mm/quicklist.c2007-03-19 17:41:42.0 
-0700
+++ 

Re: [RFC][PATCH] split file and anonymous page queues #2

2007-03-19 Thread Rik van Riel

Rik van Riel wrote:

Split the anonymous and file backed pages out onto their own pageout
queues.  This we do not unnecessarily churn through lots of anonymous
pages when we do not want to swap them out anyway.



Please take this patch for a spin and let me know what goes well
and what goes wrong.


In order to make testing easier, I have put some kernel RPMs
up on http://people.redhat.com/riel/vmsplit/

Any benchmark results are welcome, especially bad ones.
I want to make sure this thing runs as well as the current
VM in every situation, while also fixing the problems described
in my previous mail.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Christoph Lameter
On Mon, 19 Mar 2007, Andrew Morton wrote:

> > See the patch. We are only touching 2 cachelines instead of 32. So even 
> > without considering the page allocator overhead and the slab allocator 
> > overhead (which will make the situation even better) its superior.
> 
> That's not proof, it is handwaving.  I could wave right back at you and
> claim that the benefit from returning a cache-hot pte page back to the page
> allocator for reuse exceeds the benefit which you waved at me above.

No you cannot make that claim. That would mean that you have to touch 
32 pages which is inferior.
  
> You may well be right, but nothing is proven, afaict.

Nothing can be proven except within a rigorously defined mathematical 
system but even there we are limited by such things as Russel's paradox.

Its obvious that this is right. And there has been significant work 
invested into retaining page table pages on i386, sparc64 and ia64 for 
exactly the specified. This patch does not change that at all for these 3 
arches. There is no doubt about the correctness of the approach here.

> > You do not think that our current way of handling ptes is okay? If we do 
> > not zero the ptes then we need to separate munmap from process shutdown.
> 
> Yep.  It's possible that process shutdown is a sufficiently common and
> costly special-case for it to be worth special-casing.

Ok great idea but what does this have to do with this patch? This patch 
simply generalizes something that has been there for ages.

> > The advantage of the quicklists is that it does not require a rework of 
> > the pte serialization.
> 
> No, these are unrelated.  We can get pte pages from the page allocator and
> zero them without touching the munmap handling.
> 
> But it's possible that if we _were_ to optimise the munmap handling as
> suggested, the end result would be superior.

Andrew, this is utter crap and unrelated to this work. The main thing here 
is to generalize something that various arches already do and to avoid the 
page struct handling collisions. You use pie-in-the-sky to argue against 
consolidating code and fixing up usage conflicts of the slab with arch 
code?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: chrdev_open lifetime question

2007-03-19 Thread Pete Zaitcev
On Wed, 7 Mar 2007 17:23:05 -0500, "Dmitry Torokhov" <[EMAIL PROTECTED]> wrote:

> It seems that if a process keeps a character device open then other
> processes will also be able to get into filp->f_op->open(inode,filp)
> in chrdev_open() even after a driver called cdev_del() as part of its
> unwind procedure. Is this correct or am I missing something?

I see no replies in the archives. Have you got any private ones?

Also, what's the context?

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/13] signal/timer/event fds v7 - anonymous inode source ...

2007-03-19 Thread Thomas Gleixner
Davide,

On Mon, 2007-03-19 at 16:47 -0700, Davide Libenzi wrote:
> This patch add an anonymous inode source, to be used for files that need 
> and inode only in order to create a file*. We do not care of having an 
> inode for each file, and we do not even care of having different names in 
> the associated dentries (dentry names will be same for classes of file*).
> This allow code reuse, and will be used by epoll, signalfd and timerfd 
> (and whatever else there'll be).
>
> +int aino_getfd(int *pfd, struct inode **pinode, struct file **pfile,
> +char const *name, const struct file_operations *fops, void *priv)
> +{
> + struct qstr this;
> + struct dentry *dentry;
> + struct inode *inode;
> + struct file *file;
> + int error, fd;
> +
> + error = -ENFILE;
> + file = get_empty_filp();
> + if (!file)
> + goto eexit_1;

make this "return -ENFILE;" please

> + inode = aino_getinode();
> + if (IS_ERR(inode)) {
> + error = PTR_ERR(inode);
> + goto eexit_2;

Can you please use a bit more descriptive labels ?

e.g:
goto out_filp;

> + }
> +
> + error = get_unused_fd();
> + if (error < 0)
> + goto eexit_3;

e.g:
goto out_inode;

> + fd = error;
> +
> + /*
> +  * Link the inode to a directory entry by creating a unique name
> +  * using the inode sequence number.
> +  */
> + error = -ENOMEM;
> + this.name = name;
> + this.len = strlen(name);
> + this.hash = 0;
> + dentry = d_alloc(aino_mnt->mnt_sb->s_root, );
> + if (!dentry)
> + goto eexit_4;

e.g:

goto out_fd;


> +static int ainofs_delete_dentry(struct dentry *dentry)
> +{
> + /*
> +  * We faked vfs to believe the dentry was hashed when we created it.
> +  * Now we restore the flag so that dput() will work correctly.
> +  */
> + dentry->d_flags |= DCACHE_UNHASHED;
> + return 1;
> +}

Please put either "struct ainofs_dentry_operations ..." below the next
function or move ainofs_delete_dentry() above "struct
ainofs_dentry_operations ..."

It's annoying to lookup the protoypes and implemenation back and forth.

> +static struct inode *aino_getinode(void)
> +{
> + return igrab(aino_inode);
> +}

Please use "igrab(aino_inode);" directly in this one single place above.
That saves us a prototype and an useless static function with no value.

> +/*
> + * A single inode exist for all aino files. On the contrary of pipes,
> + * aino inodes has no per-instance data associated, so we can avoid
> + * the allocation of multiple of them.
> + */
> +static struct inode *aino_mkinode(void)
> +{
> + int error = -ENOMEM;
> + struct inode *inode = new_inode(aino_mnt->mnt_sb);
> +
> + if (!inode)
> + goto eexit_1;

return ERR_PTR(-ENOMEM);

> + inode->i_fop = _fops;
> +}
> +
> +static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
> +  const char *dev_name, void *data, struct vfsmount *mnt)
> +{
> + return get_sb_pseudo(fs_type, "aino:", NULL, AINOFS_MAGIC, mnt);
> +}

Please put either "struct file_system_type aino_fs_typ ..." below this
function or move ainofs_get_sb() above "struct file_system_type
aino_fs_typ ..."

> +static int __init aino_init(void)
> +{
> +
> + if (register_filesystem(_fs_type))
> + goto epanic;
> +
> + aino_mnt = kern_mount(_fs_type);
> + if (IS_ERR(aino_mnt))
> + goto epanic;
> +
> + aino_inode = aino_mkinode();
> + if (IS_ERR(aino_inode))
> + goto epanic;
> +
> + return 0;
> +
> +epanic:
> + panic("aino_init() failed\n");

Panic ? It's not life critical - is it ? 

A printk(KERN_ERR...) and a return -Exx would be sufficient.

tglx



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM doesn't work anymore in 2.6.21

2007-03-19 Thread Tobias Doerffel
On Monday 19 March 2007 22:43:20 you wrote:
> Hi,
>
> On Monday, 19 March 2007 13:50, Tobias Doerffel wrote:
> > Hi,
> >
> > Suspend to RAM used to work fine on my computer (Intel Core Duo, 1 GB
> > RAM, Intel 82801G (ICH7-chipset) mainboard, NVIDIA-gfx-card,
> > tg3-ethernet) up to 2.6.20.3. But no matter which rc of 2.6.21 I use,
> > suspend to RAM doesn't work anymore. Up to rc3 even suspending stopped at
> > "suspending console" which appearently seems to be fixed in rc4. I tried
> > rc4-git4 with minimal config (no dyndicks, no HRT, no MSI, no sound, no
> > bluetooth, no PCMCIA, no WLAN, no USB, no cpufreq) but still I can't
> > resume properly. Caps works and I can login through SSH. Back to a more
> > complete config (sound, MMC, WLAN, PCMCIA - still no dynticks or HRT -
> > see attachment "config") I get exactly the same behaviour.
> >
> > When logged in through SSH after resume I saved output of dmesg (which
> > includes full power management debug messages), see
> > attachement "dmesg-resume". The system basically seems to be back but lot
> > of things do not work such as loading/unloading e.g. my WLAN-driver
> > (ipw3945), running "top" or "dstat" etc.   "uptime" always returns 0 min,
> > even with power management debug disabled.
> >
> > Kernel:
> > Linux version 2.6.21-rc4 (gcc version 4.1.2 20061115 (prerelease) (Debian
> > 4.1.1-21)) #23 SMP PREEMPT Mon Mar 19 12:27:56 CET 2007
I made some further investigations on this issue. A complete bisect between 
2.6.20 and 2.6.21-rc4-git4 stops at a stage 
(a4bbb810dedaecf74d54b16b6dd3c33e95e1024c) where I'm not able to compile the 
kernel anymore because of compiling-errors in arch/i386/kernel/setup.c 
(ACPI-related compiling errors). Stepping some revisions back until it 
compiled again resume didn't work either.

So I started all over again with bisect only on arch/i386 and ended up at 
ceb6c46839021d5c7c338d48deac616944660124 as the bad commit. But this file 
seems to be some kind of finalization of a series of patches ("ACPICA: Remove 
duplicate table manager") so I guess it's hard to debug this thing...

> Can you please do
>
> # echo test > /sys/power/disk
> # echo disk > /sys/power/state
>
> (the system should freeze tasks, suspend devices, disable nonboot CPUs,
> wait for 5 seconds, enable nonboot CPUs, resume devices, thaw tasks and
> return to your command prompt) and see if you can reproduce the problem?
Same problem here. Works fine in 2.6.20 as well as before 
ceb6c46839021d5c7c338d48deac616944660124. Doesn't work on recent 
2.6.21-rc4-git4.

Any more information I can give?

Tobias


pgpsi2xdTnbth.pgp
Description: PGP signature


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Andrew Morton
On Mon, 19 Mar 2007 17:44:28 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Mon, 19 Mar 2007, Andrew Morton wrote:
> 
> > Please provide proof that quicklists are superior to simply going direct to
> > the page allocator for these pages.
> 
> See the patch. We are only touching 2 cachelines instead of 32. So even 
> without considering the page allocator overhead and the slab allocator 
> overhead (which will make the situation even better) its superior.

That's not proof, it is handwaving.  I could wave right back at you and
claim that the benefit from returning a cache-hot pte page back to the page
allocator for reuse exceeds the benefit which you waved at me above.

You may well be right, but nothing is proven, afaict.

> > > I doubt it. The zeroing is a by product of our way of serializing pte 
> > > handling. Its going to be difficult to change that.
> > 
> > Nick didn't think so, and I don't see the problem either.
> 
> You do not think that our current way of handling ptes is okay? If we do 
> not zero the ptes then we need to separate munmap from process shutdown.

Yep.  It's possible that process shutdown is a sufficiently common and
costly special-case for it to be worth special-casing.

> > We'll save on some bus traffic by avoiding the writeback, but how much
> > effect that will have we don't know.  Presumably little.
> 
> The advantage of the quicklists is that it does not require a rework of 
> the pte serialization.

No, these are unrelated.  We can get pte pages from the page allocator and
zero them without touching the munmap handling.

But it's possible that if we _were_ to optimise the munmap handling as
suggested, the end result would be superior.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm2

2007-03-19 Thread Randy Dunlap
On Mon, 19 Mar 2007 17:39:15 -0700 Andrew Morton wrote:

> On Mon, 19 Mar 2007 17:27:11 -0700
> Randy Dunlap <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, 7 Mar 2007 20:19:15 -0800 Andrew Morton wrote:
> > 
> > > 
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc3/2.6.21-rc3-mm2/
> > > 
> > > - This is the same as 2.6.21-rc3-mm1, except Con's CPU scheduler changes
> > >   were dropped.
> > > 
> > >   This is for A/B comparison purposes, and because those changes crashed 
> > > on
> > >   one test setup.
> > 
> > I don't quite see why this error is happening.  Looks like all
> > the nested #includes should handle it...
> > 
> > CONFIG_KEXEC=y
> > CONFIG_CRASH_DUMP=y
> > CONFIG_UTRACE=y
> > # PTRACE=n
> > # PROC_FS=n
> > 
> > In file included from arch/x86_64/kernel/crash.c:19:
> > include/linux/elfcore.h: In function 'elf_core_copy_regs':
> > include/linux/elfcore.h:103: error: dereferencing pointer to incomplete type
> > include/linux/elfcore.h:103: error: dereferencing pointer to incomplete type
> > make[1]: *** [arch/x86_64/kernel/crash.o] Error 1
> > make: *** [arch/x86_64/kernel] Error 2
> 
> Perhaps it's complaining about undefined pt_regs.  But it's there in 
> asm/ptrace.h
> which is included by linux/ptrace.h.  Perhaps there's an include snafu which 
> is
> causing that inclusion to not work.
> 
> Dunno.  Please send full .config to Roland ;)

attached.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***



config-ptrace-elfcore
Description: Binary data


[RFC][PATCH] split file and anonymous page queues #2

2007-03-19 Thread Rik van Riel

Split the anonymous and file backed pages out onto their own pageout
queues.  This we do not unnecessarily churn through lots of anonymous
pages when we do not want to swap them out anyway.

This should (with additional tuning) be a great step forward in
scalability, allowing Linux to run well on very large systems where
scanning through the anonymous memory (on our way to the page cache
memory we do want to evict) is slowing systems down significantly.

This patch has been stress tested and seems to work, but has not
been fine tuned or benchmarked yet.  For now the swappiness parameter
can be used to tweak swap aggressiveness up and down as desired, but
in the long run we may want to simply measure IO cost of page cache
and anonymous memory and auto-adjust.

We apply pressure to each of sets of the pageout queues based on:
- the size of each queue
- the fraction of recently referenced pages in each queue,
   not counting used-once file pages
- swappiness (file IO is more efficient than swap IO)

Please take this patch for a spin and let me know what goes well
and what goes wrong.

More info on the patch can be found on:

http://linux-mm.org/PageReplacementDesign

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>

Changelog:
- Fix page_anon() to put all the file pages really on the
  file list.
- Fix get_scan_ratio() to return more stable numbers, by
  properly keeping track of the scanned anon and file pages.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
--- linux-2.6.20.x86_64/fs/proc/proc_misc.c.vmsplit	2007-03-19 12:00:11.0 -0400
+++ linux-2.6.20.x86_64/fs/proc/proc_misc.c	2007-03-19 12:00:23.0 -0400
@@ -147,43 +147,47 @@ static int meminfo_read_proc(char *page,
 	 * Tagged format, for easy grepping and expansion.
 	 */
 	len = sprintf(page,
-		"MemTotal: %8lu kB\n"
-		"MemFree:  %8lu kB\n"
-		"Buffers:  %8lu kB\n"
-		"Cached:   %8lu kB\n"
-		"SwapCached:   %8lu kB\n"
-		"Active:   %8lu kB\n"
-		"Inactive: %8lu kB\n"
+		"MemTotal:   %8lu kB\n"
+		"MemFree:%8lu kB\n"
+		"Buffers:%8lu kB\n"
+		"Cached: %8lu kB\n"
+		"SwapCached: %8lu kB\n"
+		"Active(anon):   %8lu kB\n"
+		"Inactive(anon): %8lu kB\n"
+		"Active(file):   %8lu kB\n"
+		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		"HighTotal:%8lu kB\n"
-		"HighFree: %8lu kB\n"
-		"LowTotal: %8lu kB\n"
-		"LowFree:  %8lu kB\n"
-#endif
-		"SwapTotal:%8lu kB\n"
-		"SwapFree: %8lu kB\n"
-		"Dirty:%8lu kB\n"
-		"Writeback:%8lu kB\n"
-		"AnonPages:%8lu kB\n"
-		"Mapped:   %8lu kB\n"
-		"Slab: %8lu kB\n"
-		"SReclaimable: %8lu kB\n"
-		"SUnreclaim:   %8lu kB\n"
-		"PageTables:   %8lu kB\n"
-		"NFS_Unstable: %8lu kB\n"
-		"Bounce:   %8lu kB\n"
-		"CommitLimit:  %8lu kB\n"
-		"Committed_AS: %8lu kB\n"
-		"VmallocTotal: %8lu kB\n"
-		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"HighTotal:  %8lu kB\n"
+		"HighFree:   %8lu kB\n"
+		"LowTotal:   %8lu kB\n"
+		"LowFree:%8lu kB\n"
+#endif
+		"SwapTotal:  %8lu kB\n"
+		"SwapFree:   %8lu kB\n"
+		"Dirty:  %8lu kB\n"
+		"Writeback:  %8lu kB\n"
+		"AnonPages:  %8lu kB\n"
+		"Mapped: %8lu kB\n"
+		"Slab:   %8lu kB\n"
+		"SReclaimable:   %8lu kB\n"
+		"SUnreclaim: %8lu kB\n"
+		"PageTables: %8lu kB\n"
+		"NFS_Unstable:   %8lu kB\n"
+		"Bounce: %8lu kB\n"
+		"CommitLimit:%8lu kB\n"
+		"Committed_AS:   %8lu kB\n"
+		"VmallocTotal:   %8lu kB\n"
+		"VmallocUsed:%8lu kB\n"
+		"VmallocChunk:   %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
 		K(cached),
 		K(total_swapcache_pages),
-		K(global_page_state(NR_ACTIVE)),
-		K(global_page_state(NR_INACTIVE)),
+		K(global_page_state(NR_ACTIVE_ANON)),
+		K(global_page_state(NR_INACTIVE_ANON)),
+		K(global_page_state(NR_ACTIVE_FILE)),
+		K(global_page_state(NR_INACTIVE_FILE)),
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),
--- linux-2.6.20.x86_64/fs/mpage.c.vmsplit	2007-02-04 13:44:54.0 -0500
+++ linux-2.6.20.x86_64/fs/mpage.c	2007-03-19 12:00:23.0 -0400
@@ -408,12 +408,12 @@ mpage_readpages(struct address_space *ma
 	_logical_block,
 	get_block);
 			if (!pagevec_add(_pvec, page))
-__pagevec_lru_add(_pvec);
+__pagevec_lru_add_file(_pvec);
 		} else {
 			page_cache_release(page);
 		}
 	}
-	pagevec_lru_add(_pvec);
+	pagevec_lru_add_file(_pvec);
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		mpage_bio_submit(READ, bio);
--- linux-2.6.20.x86_64/fs/cifs/file.c.vmsplit	2007-03-19 12:00:10.0 -0400
+++ linux-2.6.20.x86_64/fs/cifs/file.c	2007-03-19 12:00:23.0 -0400
@@ -1746,7 +1746,7 @@ static void cifs_copy_cache_pages(struct
 		SetPageUptodate(page);
 		unlock_page(page);
 		if (!pagevec_add(plru_pvec, page))
-			__pagevec_lru_add(plru_pvec);
+			

UDP packets scheduling

2007-03-19 Thread Lukas Hejtmanek
Hello,

can anyone suggest me a proper way how to schedule UDP packets to transmit at
some given rate?

E.g., I have two boxes both having 10 GE interfaces. One box is able to
transmit at 9.9Gbps, the other one is able to receive only at about 5.5Gbps.
Flow control must be turned off for some other reason.

How can I put delay between subsequent msg sends to achieve desired
packet rate without loses, e.g., 3.5Gbps without bursts? Even nanosleep()
with the lowest possible delay seems to be too much delay. Busy loop with
clock_gettime(3) works OK on SMP boxes, but on UP it causes problems.

-- 
Lukáš Hejtmánek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND 1/1] crypto API: RSA algorithm patch (kernel version 2.6.20.1)

2007-03-19 Thread Francois Romieu
Tasos Parisinos <[EMAIL PROTECTED]> :
[...]

RSA is slow. syscalls are fast.

Which part of the kernel is supposed to benefit from this code ?

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Christoph Lameter
On Mon, 19 Mar 2007, Andrew Morton wrote:

> Please provide proof that quicklists are superior to simply going direct to
> the page allocator for these pages.

See the patch. We are only touching 2 cachelines instead of 32. So even 
without considering the page allocator overhead and the slab allocator 
overhead (which will make the situation even better) its superior.

> > I doubt it. The zeroing is a by product of our way of serializing pte 
> > handling. Its going to be difficult to change that.
> 
> Nick didn't think so, and I don't see the problem either.

You do not think that our current way of handling ptes is okay? If we do 
not zero the ptes then we need to separate munmap from process shutdown.

> We'll save on some bus traffic by avoiding the writeback, but how much
> effect that will have we don't know.  Presumably little.

The advantage of the quicklists is that it does not require a rework of 
the pte serialization.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm2

2007-03-19 Thread Andrew Morton
On Mon, 19 Mar 2007 17:27:11 -0700
Randy Dunlap <[EMAIL PROTECTED]> wrote:

> On Wed, 7 Mar 2007 20:19:15 -0800 Andrew Morton wrote:
> 
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc3/2.6.21-rc3-mm2/
> > 
> > - This is the same as 2.6.21-rc3-mm1, except Con's CPU scheduler changes
> >   were dropped.
> > 
> >   This is for A/B comparison purposes, and because those changes crashed on
> >   one test setup.
> 
> I don't quite see why this error is happening.  Looks like all
> the nested #includes should handle it...
> 
> CONFIG_KEXEC=y
> CONFIG_CRASH_DUMP=y
> CONFIG_UTRACE=y
> # PTRACE=n
> # PROC_FS=n
> 
> In file included from arch/x86_64/kernel/crash.c:19:
> include/linux/elfcore.h: In function 'elf_core_copy_regs':
> include/linux/elfcore.h:103: error: dereferencing pointer to incomplete type
> include/linux/elfcore.h:103: error: dereferencing pointer to incomplete type
> make[1]: *** [arch/x86_64/kernel/crash.o] Error 1
> make: *** [arch/x86_64/kernel] Error 2

Perhaps it's complaining about undefined pt_regs.  But it's there in 
asm/ptrace.h
which is included by linux/ptrace.h.  Perhaps there's an include snafu which is
causing that inclusion to not work.

Dunno.  Please send full .config to Roland ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/13] signal/timer/event fds v7 - timerfd core ...

2007-03-19 Thread Eric Dumazet

Davide Libenzi a écrit :


+struct timerfd_ctx {
+   struct hrtimer tmr;
+   ktime_t tintv;
+   spinlock_t lock;
+   wait_queue_head_t wqh;
+   unsigned long ticks;
+};



+static struct kmem_cache *timerfd_ctx_cachep;



+   timerfd_ctx_cachep = kmem_cache_create("timerfd_ctx_cache",
+   sizeof(struct timerfd_ctx),
+   0, SLAB_PANIC, NULL, NULL);



Do we really expect thousands of active timerfd_ctx ?

If not, using kmalloc()/kfree() would be fine, because sizeof(struct 
timerfd_ctx) is so small.


on SMP / NUMA platforms, each new kmem_cache is rather expensive. (memory 
allocated at kmem_cache_create(), but also memory used when cache is not 
empty, with slabs in freelist for each cpu/node)


Using a general cache might be cheaper : No memory overhead for yet another 
kmem_cache.


I know individual caches are good to spot memory leaks, but in timerfd case, 
you dont have mem leaks, do you ? :)




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 16:36 -0500, Matt Mackall wrote:
> On Mon, Mar 19, 2007 at 11:06:33PM +0200, Artem Bityutskiy wrote:
> > On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
> > > The issue is 14000 lines of patch to make a parallel subsystem.
> > 
> > Parallel system exists since very long. One is
> > flash->SW_or_HW_FTL->all_blkdev_stuff. The other is MTD->JFFS2. Think
> > about _why_ there are 2 of them. Hint - reliability, performance. Your
> > ranting basically says that only the first one makes sense. This is not
> > true.
> 
> A better way would be for MTD to deliver a block dev with a rich
> enough interface for JFFS2 to use efficiently in the first place. Yes,
> I know that can't be done with the current block dev layer. But that's
> what the source is for.

Why the hell would JFFS2 need a block device interface ? 

What's the gain ?

> > We enhance the second branch, not the first, please, realize this. Both
> > branches have their user base, and have always had.
> > 
> > >  iSCSI/nbd(6)
> > >   |
> > > filesystem {swap  |  ext3ext3 jffs2
> > >   \   |   ||   /
> > >/   \  | dm-crypt->snapshot(5) /
> > > device mapper -|\ \   |  /
> > >| partitioning   /
> > >|  |  partitioning(4)
> > >|wear leveling(3)  /
> > >|  |  /
> > >|  block concatenation
> > >|   ||| |
> > >\  bad block remapping(2)   
> > >||| |
> > > MTD raw block { raw block devices with no smarts(1)
> > >   / | \  \
> > > hardware { NANDNAND   NAND   NAND
> > 
> > Matt, as I pointed in the first mail, flash != block device. 
> 
> And as I pointed out, you're wrong. It is both block oriented
> (eraseBLOCK??) and random access. That's what a block device is. The
> fact that it doesn't look like the other things that Linux currently
> calls a block device and supports well is another matter.

It does well matter, as it is not a block device. It is a FLASH device
and you can do as much comparisons of eraseBLOCK as you want, you do not
turn FLASH into a DISK. 

Again: Disks (including CF-Cards and USB-Sticks) have intellegent
controllers, which abstract the hardware oddities away and present you a
block device.

> > In your picture I see NAND->MTD raw block. So am I right that you
> > assume that we already have a decent FTL? The fact is that we do
> > not.
> 
> No. Look at the picture for more than two seconds, please. 
> 
> I can tell you didn't do this because you didn't manage to find (1)
> which explicitly says "with no smarts". And you also cut out the footnote
> where I explained what I meant by "with no smarts".
>
> Find the spots marked (2) and (3). These are your FTL. 

And where please are (2) and (3) inside of device mapper ?

> > Please, bear in mind that decent FTL is difficult and an FS on top of
> > FTL is slow, FTL hits performance considerably.
> 
> ...and if you'd actually looked at the picture, you'd have seen JFFS2
> bypassing it. Along with another footnote explaining it.

The (4) partitioning and JFFS2 on top is a step back from the current
UBI functionality. Now we can have resizable partitioning even for JFFS2
and JFFS2 can utilize the UBI wear levelling, which is way better than
the crude heuristics of JFFS2.

You want to force FLASH into device mapper for some strange and no
obvious reason. Just the coincidence of "eraseBLOCK" and "BLOCKdevice"
is not really convincing. 

You impose the usage of eraseblock size on FLASH, which is simply wrong:

DISK has a 1:1 relationship of "eraseblock" and minimal I/O. FLASH has
not. I did the math in a different mail and I'm not buying your factor
32 FLASH life time reduction for the price of having a bunch of lines of
code less in the kernel.

If you really consider to run ext3, xfs or whatever on top of FLASH,
please go and do the homework on CF-Cards and USB-Sticks. Run them into
the fast wearout death. And device mapper does not help anything to
avoid that. Running ext3 on top of FLASH with a minimal I/O size of
erase block size is simply braindead.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
> > > If a static volume is simply a non-dynamic volume, then device mapper
> > > can do that too. And countless other things. Which is not an aside.
> > > UBI growing to do all the things that device mapper does is exactly
> > > the thing we should be seeking to avoid.
> > 
> > No it can't and device mapper sits on top of block devices. FLASH is no
> > block device. Period.
> 
> Which of the following two properties does it lack?
> 
> - discrete blocks
> - non-sequential access to blocks
> 
> When you do the obvious s/blocks/eraseblocks/, this appears to be
> true.

It appears to be, but it is not. You enforce semantics on a device,
which it does not have.

> Saying "but I can't do I/O smaller than the blocksize" doesn't change
> this any more than it would for disks.

There is a huge difference. Disk block size is 512 byte and FLASH block
size is min 16KiB and up to 256KiB.

Just do the math:

Write sampling data streams in 2KiB chunks to your uber devicemapper on
a 1GiB device with 64KiB erase block size:

Fine grained FLASH aware writes allow 32 chunks in a block without
erasing the block.

Your method erases the block 32 times to write the same amount of data.

Result: You wear out the flash 32 times faster. Cool feature.

> Saying "but I can do smaller I/O efficiently in some circumstances"
> also doesn't change it.

We can do it under _any_ circumstances and that _does_ change it.
Implementing a clever block device layer on top of UBI is simple and
would provide FLASH page sized I/O, i.e. 2Kib in the above example.

> In historical UNIX, some tapes were block devices too. Because they
> supported seek().

I'm impressed. How exactly are "some tapes" comparable to FLASH chips ?

Your next proposal is to throw away MTD-utils and use "mt" instead ?

> > Device mapper can not provide a simple easy to decode scheme for boot
> > loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> > and be able to find the kernel or second stage boot loader in this
> > unordered device.
> > 
> > And no, fixed addresses do not work. Do you want to implement device
> > mapper into your Initialial Bootloader stage ?
> 
> This is exactly the same problem as booting on a desktop PC. But
> somehow LILO manages. My first Linux box had a hell of a lot less disk
> than the platform I bootstrapped (and wrote NAND drivers for) last
> month had in NAND.

No, it is not. You get the absolute sector address of your second stage
and this is a complete nobrainer. The translation is done in the DISK
device.

You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
whatever - there is a more or less intellegent controller device, which
does the mapping to the physical storage location. There is _NO_ such
thing on a bare FLASH chip.

It does not matter, whether your embedded device had more NAND space
than my old CP/M machines floppy. It simply matters, that even the old
CP/M floppy device had some rudimentary intellence on board.

Furthermore I want to be able to get the bitflip correction on my second
stage loader / kernel in the same safe way as we do it for everything
else and still be able to bootstrap that from an extremly small
bootloader.

> > > If the right way is instead to extend the block layer and device
> > > mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> > > should not go in.
> > 
> > No, block layer on top of FLASH needs 80% of the functionality of UBI in
> > the first place.
> 
> Incorrect. A block-based filesystem on top of flash needs this
> functionality. But a block device suitable to device mapper layering
> (which then provides the functionality) does not.

How exactly does device mapper:

A) across device wear levelling ?
B) dynamic partitioning for FLASH aware file systems ?
C) across device wear levelling for FLASH aware file systems ?
D) background bit-flip corrections (copying affected blocks and recylce
the old one) ?
E) allow position independent placement of the second stage bootloader ?

> > You need to implement a clever journalling block device
> > emulator in order to keep the data alive and the FLASH not weared out
> > within no time. You need the wear levelling, otherwise you can throw
> > away your FLASH in no time.
> 
> And that's why it's in my picture.

Yes, it is in your picture, but:

1) it excludes FLASH aware file systems and UBI does not.
2) your picture does still not explain how it does achive the above A),
B), C), D) and E)

Your extra path for partitioning(4) and JFFS2 is just a weird hack,
which makes your proposal completely absurd.

> > > Let me draw a picture so we have something to argue about:
> > > 
> > >  iSCSI/nbd(6)
> > >   |
> > > filesystem {swap  |  ext3ext3 jffs2
> > >   \   |   ||   /
> > >/   \  | dm-crypt->snapshot(5) /
> > > device mapper -|  

Re: [PATCH 2/3] swsusp: Do not use page flags

2007-03-19 Thread Andrew Morton
On Mon, 12 Mar 2007 22:19:20 +0100
"Rafael J. Wysocki" <[EMAIL PROTECTED]> wrote:

> Make swsusp use memory bitmaps instead of page flags for marking 'nosave' and
> free pages.  This allows us to 'recycle' two page flags that can be used for 
> other
> purposes.  Also, the memory needed to store the bitmaps is allocated when
> necessary (ie. before the suspend) and freed after the resume which is more
> reasonable.
> 
> The patch is designed to minimize the amount of changes and there are some 
> nice
> simplifications and optimizations possible on top of it.  I am going to
> implement them separately in the future.

Blows up with ia64 allmodconfig due to CONFIG_PM=y, CONFIG_SOFTWARE_SUSPEND=n:

kernel/power/main.c:223: error: redefinition of 'software_suspend'
include/linux/suspend.h:46: error: previous definition of 'software_suspend' 
was here

I had a look at fixing it, but it's unobvious why we're compiling most of
kernel/power/main.c when CONFIG_SOFTWARE_SUSPEND=n so I'll send this series
back for repair please.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm2

2007-03-19 Thread Randy Dunlap
On Wed, 7 Mar 2007 20:19:15 -0800 Andrew Morton wrote:

> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc3/2.6.21-rc3-mm2/
> 
> - This is the same as 2.6.21-rc3-mm1, except Con's CPU scheduler changes
>   were dropped.
> 
>   This is for A/B comparison purposes, and because those changes crashed on
>   one test setup.

I don't quite see why this error is happening.  Looks like all
the nested #includes should handle it...

CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_UTRACE=y
# PTRACE=n
# PROC_FS=n

In file included from arch/x86_64/kernel/crash.c:19:
include/linux/elfcore.h: In function 'elf_core_copy_regs':
include/linux/elfcore.h:103: error: dereferencing pointer to incomplete type
include/linux/elfcore.h:103: error: dereferencing pointer to incomplete type
make[1]: *** [arch/x86_64/kernel/crash.o] Error 1
make: *** [arch/x86_64/kernel] Error 2

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/13] signal/timer/event fds v7 - signalfd core ...

2007-03-19 Thread Davide Libenzi
On Tue, 20 Mar 2007, Oleg Nesterov wrote:

> On 03/19, Davide Libenzi wrote:
> >
> > +static void signalfd_unlock(struct signalfd_ctx *ctx,
> > +   struct signalfd_lockctx *lk)
> > +{
> > +   unlock_task_sighand(lk->tsk, >flags);
> > +}
> 
> Again, this is a matter of taste. But I can't understand why signalfd_unlock()
> needs "signalfd_ctx *ctx" parameter. If we have "struct signalfd_lockctx *lk",
> signalfd_lock() can setup lk->ctx if it is ever needed.

With the new API, I agree. Removed.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/13] signal/timer/event fds v7 - signalfd core ...

2007-03-19 Thread Oleg Nesterov
On 03/19, Davide Libenzi wrote:
>
> +static void signalfd_unlock(struct signalfd_ctx *ctx,
> + struct signalfd_lockctx *lk)
> +{
> + unlock_task_sighand(lk->tsk, >flags);
> +}

Again, this is a matter of taste. But I can't understand why signalfd_unlock()
needs "signalfd_ctx *ctx" parameter. If we have "struct signalfd_lockctx *lk",
signalfd_lock() can setup lk->ctx if it is ever needed.

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/13] signal/timer/event fds v7 - signalfd core ...

2007-03-19 Thread Davide Libenzi
On Tue, 20 Mar 2007, Oleg Nesterov wrote:

> On 03/19, Davide Libenzi wrote:
> >
> > +struct signalfd_lockctx {
> > +   struct task_struct *tsk;
> > +   struct sighand_struct *sighand;
> > +   unsigned long flags;
> > +};
> 
> signalfd_lockctx is "private" to signalfd_lock/signalfd_unlock. But 
> lk->sighand
> is used only by signalfd_lock(). I'd suggest to remove it.

Ack



> > +void signalfd_deliver(struct task_struct *tsk, int sig)
> > +{
> > +   struct sighand_struct *sighand = tsk->sighand;
> > +   struct signalfd_ctx *ctx, *tmp;
> > +
> > +   list_for_each_entry_safe(ctx, tmp, >sfdlist, lnk) {
> > +   /*
> > +* We use a negative signal value as a way to broadcast that the
> > +* sighand has been orphaned, so that we can notify all the
> > +* listeners about this. Remeber the ctx->sigmask is inverted,
> > +* so if the user is interested in a signal, that corresponding
> > +* bit will be zero.
> > +*/
> > +   if (sig < 0) {
> > +   if (ctx->tsk == tsk) {
> > +   ctx->tsk = NULL;
> > +   list_del_init(>lnk);
> > +   wake_up(>wqh);
> > +   }
> > +   } else if (sig > 0) {
> > +   if (!sigismember(>sigmask, sig))
> > +   wake_up(>wqh);
> > +   }
> > +   }
> > +}
> 
> I tried to avoid this comment, but can't help myself :)

Added BUG_ON() and using "else".



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Andrew Morton
On Mon, 19 Mar 2007 15:37:16 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> ...
>
> --- /dev/null 1970-01-01 00:00:00.0 +
> +++ linux-2.6.21-rc3-mm2/include/linux/quicklist.h2007-03-16 
> 02:19:15.0 -0700
> @@ -0,0 +1,95 @@
> +#ifndef LINUX_QUICKLIST_H
> +#define LINUX_QUICKLIST_H
> +/*
> + * Fast allocations and disposal of pages. Pages must be in the condition
> + * as needed after allocation when they are freed. Per cpu lists of pages
> + * are kept that only contain node local pages.
> + *
> + * (C) 2007, SGI. Christoph Lameter <[EMAIL PROTECTED]>
> + */
> +#include 
> +#include 
> +#include 
> +
> +#ifdef CONFIG_QUICKLIST
> +
> +#ifndef CONFIG_NR_QUICK
> +#define CONFIG_NR_QUICK 1
> +#endif

No, please don't define config items like this.  Do it in Kconfig.

> +static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void 
> *))
> +{
> + struct quicklist *q;
> + void **p = NULL;
> +
> + q =_cpu_var(quicklist)[nr];
> + p = q->page;
> + if (likely(p)) {
> + q->page = p[0];
> + p[0] = NULL;
> + q->nr_pages--;
> + }
> + put_cpu_var(quicklist);
> + if (likely(p))
> + return p;
> +
> + p = (void *)__get_free_page(flags | __GFP_ZERO);
> + if (ctor && p)
> + ctor(p);
> + return p;
> +}
> +
> +static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp)
> +{
> + struct quicklist *q;
> + void **p = pp;
> + struct page *page = virt_to_page(p);
> + int nid = page_to_nid(page);
> +
> + if (unlikely(nid != numa_node_id())) {
> + if (dtor)
> + dtor(p);
> + free_page((unsigned long)p);
> + return;
> + }
> +
> + q = _cpu_var(quicklist)[nr];
> + p[0] = q->page;
> + q->page = p;
> + q->nr_pages++;
> + put_cpu_var(quicklist);
> +}

These guys seem to have multiple callsites for ia64 at least and probably
would benefit from being uninlined.

> +void quicklist_check(int nr, void (*dtor)(void *));
> +unsigned long quicklist_total_size(void);
> +
> +#else
> +void quicklist_check(int nr, void (*dtor)(void *))
> +{
> +}
> +
> +unsigned long quicklist_total_size(void)
> +{
> + return 0;
> +}
> +#endif

That obviouslty won't link and wasn't tested.  Making these static inline
will help.

> +/*
> + * Quicklist support.
> + *
> + * Quicklists are light weight lists of pages that have a defined state
> + * on alloc and free. Pages must be in the quicklist specific defined state
> + * (zero by default) when the page is freed. It seems that the initial idea
> + * for such lists first came from Dave Miller and then various other people
> + * improved on it.
> + *
> + * Copyright (C) 2007 SGI,
> + *   Christoph Lameter <[EMAIL PROTECTED]>
> + *   Generalized, added support for multiple lists and
> + *   constructors / destructors.
> + */
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +DEFINE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK];

If we uninline those big inlines, this can perhaps be made static.

> +#define MIN_PAGES25
> +#define MAX_FREES_PER_PASS   16
> +#define FRACTION_OF_NODE_MEM 16

Are these constants optimal for all architectures?

> +static unsigned long max_pages(void)
> +{
> + unsigned long node_free_pages, max;
> +
> + node_free_pages = node_page_state(numa_node_id(),
> + NR_FREE_PAGES);
> + max = node_free_pages / FRACTION_OF_NODE_MEM;
> + return max(max, (unsigned long)MIN_PAGES);
> +}
> +
> +static long min_pages_to_free(struct quicklist *q)
> +{
> + long pages_to_free;
> +
> + pages_to_free = q->nr_pages - max_pages();
> +
> + return min(pages_to_free, (long)MAX_FREES_PER_PASS);
> +}

min_t and max_t are the standard way of avoiding that warning.  Or stick a
UL on the constants (which is probably better).

> +void quicklist_check(int nr, void (*dtor)(void *))
> +{
> + long pages_to_free;
> + struct quicklist *q;
> +
> + q = _cpu_var(quicklist)[nr];
> + if (q->nr_pages > MIN_PAGES) {
> + pages_to_free = min_pages_to_free(q);
> +
> + while (pages_to_free > 0) {
> + void *p = quicklist_alloc(nr, 0, NULL);
> +
> + if (dtor)
> + dtor(p);
> + free_page((unsigned long)p);
> + pages_to_free--;
> + }
> + }
> + put_cpu_var(quicklist);
> +}

The use of a literal 0 as a gfp_t is a bit ugly.  I assume that we don't
care because we should never actually call into the page allocator for this
caller.  But it's not terribly clear because there is no commentary
describing what this function is supposed to do.

The name foo_check() is unfortunate: it implies that the function checks
something (ie: has no side-effects).  But this function _does_ change
things and perhaps should be called 

Re: [patch 2/13] signal/timer/event fds v7 - signalfd core ...

2007-03-19 Thread Oleg Nesterov
On 03/19, Davide Libenzi wrote:
>
> +struct signalfd_lockctx {
> + struct task_struct *tsk;
> + struct sighand_struct *sighand;
> + unsigned long flags;
> +};

signalfd_lockctx is "private" to signalfd_lock/signalfd_unlock. But lk->sighand
is used only by signalfd_lock(). I'd suggest to remove it.

> +void signalfd_deliver(struct task_struct *tsk, int sig)
> +{
> + struct sighand_struct *sighand = tsk->sighand;
> + struct signalfd_ctx *ctx, *tmp;
> +
> + list_for_each_entry_safe(ctx, tmp, >sfdlist, lnk) {
> + /*
> +  * We use a negative signal value as a way to broadcast that the
> +  * sighand has been orphaned, so that we can notify all the
> +  * listeners about this. Remeber the ctx->sigmask is inverted,
> +  * so if the user is interested in a signal, that corresponding
> +  * bit will be zero.
> +  */
> + if (sig < 0) {
> + if (ctx->tsk == tsk) {
> + ctx->tsk = NULL;
> + list_del_init(>lnk);
> + wake_up(>wqh);
> + }
> + } else if (sig > 0) {
> + if (!sigismember(>sigmask, sig))
> + wake_up(>wqh);
> + }
> + }
> +}

I tried to avoid this comment, but can't help myself :)

This is a matter of taste, of course, but imho this is a classical "hide the
problem" example.

Why "else if (sig > 0)" ? sig can't be == 0. In my opinion, it is better to
add BUG_ON(!sig), but use just "else".

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Andrew Morton
On Mon, 19 Mar 2007 16:57:55 -0700 (PDT)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Mon, 19 Mar 2007, Andrew Morton wrote:
> 
> > Has it been proven that quicklists are superior to simply going direct to 
> > the
> > page allocator for these pages?
> 
> Yes.

Sigh.

Please provide proof that quicklists are superior to simply going direct to
the page allocator for these pages.

> > Would it provide a superior solution if we were to a) stop zeroing out the
> > pte's when doing a fullmm==1 teardown and b) go direct to the page allocator
> > for these pages?
> 
> I doubt it. The zeroing is a by product of our way of serializing pte 
> handling. Its going to be difficult to change that.

Nick didn't think so, and I don't see the problem either.

We'll save on some bus traffic by avoiding the writeback, but how much
effect that will have we don't know.  Presumably little.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


unify encoding of files in Documentation

2007-03-19 Thread Till Maas
Hiyas,

at the moment, some file in Documentation are utf-8 encoded and some are 
latin1 encoded. Therefore I propose to change the default encoding to utf-8, 
because this is the encoding that may current linux distributions use.

I can send a patch, if required. If you want to change the encoding of a file 
from latin1 to utf-8 you can use recode:

recode latin1..utf-8 file.txt

This changes the encoding in place.

Regards,
Till





pgpavc3q6nH51.pgp
Description: PGP signature


Re: [PATCH 0/2] wistron_btns: More keymaps

2007-03-19 Thread Éric Piel

19.03.2007 22:28, Dmitry Torokhov wrote/a écrit:

On 3/15/07, Éric Piel <[EMAIL PROTECTED]> wrote:


Ok, so let me summarize:

There are two kinds of keys on those laptops (for which we are not sure
about the keycode that it should generate):
* Laptop screen on/off
* Display output selection (for instance: laptop/external/both)

The possible keycodes that we could assign to them:
KEY_SCREEN
KEY_MEDIA
KEY_MODE
KEY_VIDEO
KEY_SWITCHVIDEOMODE
KEY_COMPUTER
KEY_PC

 From the discussion, I had the feeling this association would be the
less incorrect:
Screen on/off : KEY_SCREEN


It looks like DVB folks chose to ise KEY_SCREEN and KEY_WINDOW to
switch applications between full screen and windowed modes
Just for info, I couldn't find any reference to KEY_WINDOW. Anyway, 
indeed, KEY_SCREEN is already used for "full screen" (although sometimes 
it's KEY_ZOOM :-/) so better not using it if something else is possible.



so we'll
have to invent our own keycode. KEY_DISPLAYTOGGLE anyone?

What about KEY_DISPLAYONOFF ? :-)
What should be its value? Would 239 be fine?


Display selection : KEY_SWITCHVIDEOMODE
I agree here. 


BTW, I'm thinking of implementing led support. However, there are two 
mechanisms for leds in the kernel: the "input layer" leds and the "full 
feature" leds. The laptops may have up to three leds: mail, wifi, 
bluetooth. The input layer has LED_MAIL but no wifi nor bluetooth. The 
led subsystem has the advantage of the very extensible "trigger" 
mechanism. Which of the subsystems would you recommend me to use?


See you,
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] x25_forward_call(): fix NULL dereferences

2007-03-19 Thread David Miller
From: Adrian Bunk <[EMAIL PROTECTED]>
Date: Mon, 19 Mar 2007 10:24:03 +0100

> This patch fixes two NULL dereferences spotted by the Coverity checker.
> 
> For a better understanding, the "diff -uwp" output (that ignores the 
> indentation changes) is:

I'll apply this, thanks Adrian.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Rusty Russell
On Mon, 2007-03-19 at 11:38 -0700, Linus Torvalds wrote:
> 
> On Mon, 19 Mar 2007, Eric W. Biederman wrote:
> > 
> > True.  You can use all of the call clobbered registers.
> 
> Quite often, the biggest single win of inlining is not so much the code 
> size (although if done right, that will be smaller too), but the fact that 
> inlining DOES NOT CLOBBER AS MANY REGISTERS!

Thanks Linus.

*This* was the reason that the current hand-coded calls only clobber %
eax.  It was a compromise between native (no clobbers) and others (might
need a reg).

Now, since we decided to allow paravirt_ops operations to be normal C
(ie. the patching is optional and done late), we actually push and pop %
ecx and %edx.  This makes the call site 10 bytes long, which is a nice
size for patching anyway (enough for a movl $0, , a-la lguest's
cli, or movw $0, %gs: if we supported SMP).

The current 6 paravirt ops which are patched cover the vast majority of
calls (until the Xen patches, then we need ~4 more?).  Jeremy chose to
expand patching to cover *all* paravirt ops, rather than just the new
hot ones, and that's where we tipped over the ugliness threshold.

Cheers,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Christoph Lameter
On Mon, 19 Mar 2007, Andrew Morton wrote:

> Has it been proven that quicklists are superior to simply going direct to the
> page allocator for these pages?

Yes.
 
> Would it provide a superior solution if we were to a) stop zeroing out the
> pte's when doing a fullmm==1 teardown and b) go direct to the page allocator
> for these pages?

I doubt it. The zeroing is a by product of our way of serializing pte 
handling. Its going to be difficult to change that.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread David Miller
From: Andrew Morton <[EMAIL PROTECTED]>
Date: Mon, 19 Mar 2007 16:53:29 -0700

> Would it provide a superior solution if we were to a) stop zeroing out the
> pte's when doing a fullmm==1 teardown and b) go direct to the page allocator
> for these pages?

While you could avoid zero'ing them out, you certainly can't avoid
reading them into the cpu caches.

And for the PGDs you have to initialize these things partially to
non-zero values on x86{,_64} on every new PGD you allocate, which is a
complete waste of cpu cache dirtying.  Avoiding this overhead alone
justifies the quicklists I think.  It's not just a "zero" thing,
so GFP_ZERO cannot help you here.

The more I think about it the more I like the quicklists.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 1/5] Quicklists for page table pages V3

2007-03-19 Thread Andrew Morton
On Mon, 19 Mar 2007 15:37:16 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> This patchset introduces an arch independent framework to handle lists
> of recently used page table pages to replace the existing (ab)use of the
> slab for that purpose.
> 
> 1. Proven code from the IA64 arch.

Has it been proven that quicklists are superior to simply going direct to the
page allocator for these pages?

Would it provide a superior solution if we were to a) stop zeroing out the
pte's when doing a fullmm==1 teardown and b) go direct to the page allocator
for these pages?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/13] signal/timer/event fds v7 - signalfd wire up x86_64 arch ...

2007-03-19 Thread Davide Libenzi
This patch wire the signalfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-x86_64/unistd.h 2007-03-19 
16:03:26.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h  2007-03-19 
16:41:30.0 -0700
@@ -619,8 +619,10 @@
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_signalfd  280
+__SYSCALL(__NR_signalfd, sys_signalfd)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_signalfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/x86_64/ia32/ia32entry.S2007-03-19 
16:03:26.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S 2007-03-19 
16:41:30.0 -0700
@@ -714,9 +714,10 @@
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
-   .quad sys_tee
+   .quad sys_tee   /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
.quad sys_epoll_pwait
+   .quad sys_signalfd  /* 320 */
 ia32_syscall_end:  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 8/13] signal/timer/event fds v7 - timerfd wire up x86_64 arch ...

2007-03-19 Thread Davide Libenzi
This patch wire the timerfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/x86_64/ia32/ia32entry.S2007-03-19 
16:41:30.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S 2007-03-19 
16:41:37.0 -0700
@@ -720,4 +720,5 @@
.quad sys_getcpu
.quad sys_epoll_pwait
.quad sys_signalfd  /* 320 */
+   .quad sys_timerfd
 ia32_syscall_end:  
Index: linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-x86_64/unistd.h 2007-03-19 
16:41:30.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h  2007-03-19 
16:41:37.0 -0700
@@ -621,8 +621,10 @@
 __SYSCALL(__NR_move_pages, sys_move_pages)
 #define __NR_signalfd  280
 __SYSCALL(__NR_signalfd, sys_signalfd)
+#define __NR_timerfd   281
+__SYSCALL(__NR_timerfd, sys_timerfd)
 
-#define __NR_syscall_max __NR_signalfd
+#define __NR_syscall_max __NR_timerfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 5/13] signal/timer/event fds v7 - signalfd compat code ...

2007-03-19 Thread Davide Libenzi
This patch implement the necessary compat code for the signalfd system call.


Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/fs/compat.c
===
--- linux-2.6.21-rc3.quilt.orig/fs/compat.c 2007-03-19 16:03:26.0 
-0700
+++ linux-2.6.21-rc3.quilt/fs/compat.c  2007-03-19 16:41:32.0 -0700
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2235,3 +2236,24 @@
return sys_ni_syscall();
 }
 #endif
+
+asmlinkage long compat_sys_signalfd(int ufd,
+   const compat_sigset_t __user *sigmask,
+   compat_size_t sigsetsize)
+{
+   compat_sigset_t ss32;
+   sigset_t tmp;
+   sigset_t __user *ksigmask;
+
+   if (sigsetsize != sizeof(compat_sigset_t))
+   return -EINVAL;
+   if (copy_from_user(, sigmask, sizeof(ss32)))
+   return -EFAULT;
+   sigset_from_compat(, );
+   ksigmask = compat_alloc_user_space(sizeof(sigset_t));
+   if (copy_to_user(ksigmask, , sizeof(sigset_t)))
+   return -EFAULT;
+
+   return sys_signalfd(ufd, ksigmask, sizeof(sigset_t));
+}
+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 10/13] signal/timer/event fds v7 - eventfd core ...

2007-03-19 Thread Davide Libenzi
This is a very simple and light file descriptor, that can be used as
event wait/dispatch by userspace (both wait and dispatch) and by the
kernel (dispatch only). It can be used instead of pipe(2) in all cases
where those would simply be used to signal events. Their kernel overhead
is much lower than pipes, and they do not consume two fds. When used in
the kernel, it can offer an fd-bridge to enable, for example, functionalities
like KAIO or syslets/threadlets to signal to an fd the completion of certain
operations. But more in general, an eventfd can be used by the kernel to
signal readiness, in a POSIX poll/select way, of interfaces that would
otherwise be incompatible with it. The API is:

int eventfd(unsigned int count);

The eventfd API accepts an initial "count" parameter, and returns an
eventfd fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and 
write(2).
The POLLIN flag is raised when the internal counter is greater than zero.
The POLLOUT flag is raised when at least a value of "1" can be written to
the internal counter.
The POLLERR flag is raised when an overflow in the counter value is detected.
The write(2) operation can never overflow the counter, since it blocks
(unless O_NONBLOCK is set, in which case -EAGAIN is returned).
But the eventfd_signal() function can do it, since it's supposed to not
sleep during its operation.
The read(2) function reads the __u64 counter value, and reset the internal
value to zero. If the value read is equal to (__u64) -1, an overflow
happened on the internal counter (due to 2^64 eventfd_signal() posts
that has never been retired - unlickely, but possible).
The write(2) call writes an __u64 count value, and adds it
to the current counter. The eventfd fd supports O_NONBLOCK also.
On the kernel side, we have:

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);

The eventfd_fget() should be called to get a struct file* from an eventfd
fd (this is an fget() + check of f_op being an eventfd fops pointer).
The kernel can then call eventfd_signal() every time it wants to post
an event to userspace. The eventfd_signal() function can be called from any
context.
An eventfd() simple test and bench is available here:

http://www.xmailserver.org/eventfd-bench.c

This is the eventfd-based version of pipetest-4 (pipe(2) based):

http://www.xmailserver.org/pipetest-4.c

Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.




Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/Makefile
===
--- linux-2.6.21-rc3.quilt.orig/fs/Makefile 2007-03-19 16:41:33.0 
-0700
+++ linux-2.6.21-rc3.quilt/fs/Makefile  2007-03-19 16:41:40.0 -0700
@@ -11,7 +11,7 @@
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o anon_inodes.o signalfd.o timerfd.o
+   stack.o anon_inodes.o signalfd.o timerfd.o eventfd.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
Index: linux-2.6.21-rc3.quilt/include/linux/syscalls.h
===
--- linux-2.6.21-rc3.quilt.orig/include/linux/syscalls.h2007-03-19 
16:41:33.0 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/syscalls.h 2007-03-19 
16:41:40.0 -0700
@@ -605,6 +605,7 @@
 asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t 
sizemask);
 asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
const struct itimerspec __user *utmr);
+asmlinkage long sys_eventfd(unsigned int count);
 
 int kernel_execve(const char *filename, char *const argv[], char *const 
envp[]);
 
Index: linux-2.6.21-rc3.quilt/fs/eventfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc3.quilt/fs/eventfd.c 2007-03-19 16:41:40.0 -0700
@@ -0,0 +1,253 @@
+/*
+ *  fs/eventfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct eventfd_ctx {
+   spinlock_t lock;
+   wait_queue_head_t wqh;
+   __u64 count;
+};
+
+
+static void eventfd_cleanup(struct eventfd_ctx *ctx);
+static int eventfd_close(struct inode *inode, struct file *file);
+static unsigned int eventfd_poll(struct file *file, poll_table *wait);
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+   loff_t *ppos);
+static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t 
count,
+

[patch 11/13] signal/timer/event fds v7 - eventfd wire up i386 arch ...

2007-03-19 Thread Davide Libenzi
This patch wire the eventfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide


Index: linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/i386/kernel/syscall_table.S
2007-03-19 16:41:35.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S 2007-03-19 
16:41:42.0 -0700
@@ -321,3 +321,4 @@
.long sys_epoll_pwait
.long sys_signalfd  /* 320 */
.long sys_timerfd
+   .long sys_eventfd
Index: linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-i386/unistd.h   2007-03-19 
16:41:35.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h2007-03-19 
16:41:42.0 -0700
@@ -327,10 +327,11 @@
 #define __NR_epoll_pwait   319
 #define __NR_signalfd  320
 #define __NR_timerfd   321
+#define __NR_eventfd   322
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 322
+#define NR_syscalls 323
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 12/13] signal/timer/event fds v7 - eventfd wire up x86_64 arch ...

2007-03-19 Thread Davide Libenzi
This patch wire the eventfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/x86_64/ia32/ia32entry.S2007-03-19 
16:41:37.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S 2007-03-19 
16:41:43.0 -0700
@@ -721,4 +721,5 @@
.quad sys_epoll_pwait
.quad sys_signalfd  /* 320 */
.quad sys_timerfd
+   .quad sys_eventfd
 ia32_syscall_end:  
Index: linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-x86_64/unistd.h 2007-03-19 
16:41:37.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h  2007-03-19 
16:41:43.0 -0700
@@ -623,8 +623,10 @@
 __SYSCALL(__NR_signalfd, sys_signalfd)
 #define __NR_timerfd   281
 __SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_eventfd   282
+__SYSCALL(__NR_eventfd, sys_eventfd)
 
-#define __NR_syscall_max __NR_timerfd
+#define __NR_syscall_max __NR_eventfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 9/13] signal/timer/event fds v7 - timerfd compat code ...

2007-03-19 Thread Davide Libenzi
This patch implement the necessary compat code for the timerfd system call.


Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/fs/compat.c
===
--- linux-2.6.21-rc3.quilt.orig/fs/compat.c 2007-03-19 16:41:32.0 
-0700
+++ linux-2.6.21-rc3.quilt/fs/compat.c  2007-03-19 16:41:38.0 -0700
@@ -2257,3 +2257,23 @@
return sys_signalfd(ufd, ksigmask, sizeof(sigset_t));
 }
 
+
+asmlinkage long compat_sys_timerfd(int ufd, int clockid, int flags,
+  const struct compat_itimerspec __user *utmr)
+{
+   long res;
+   struct itimerspec t;
+   struct itimerspec __user *ut;
+
+   res = -EFAULT;
+   if (get_compat_itimerspec(, utmr))
+   goto err_exit;
+   ut = compat_alloc_user_space(sizeof(*ut));
+   if (copy_to_user(ut, , sizeof(t)) )
+   goto err_exit;
+
+   res = sys_timerfd(ufd, clockid, flags, ut);
+err_exit:
+   return res;
+}
+
Index: linux-2.6.21-rc3.quilt/include/linux/compat.h
===
--- linux-2.6.21-rc3.quilt.orig/include/linux/compat.h  2007-03-19 
16:03:26.0 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/compat.h   2007-03-19 
16:41:38.0 -0700
@@ -225,6 +225,11 @@
return lhs->tv_nsec - rhs->tv_nsec;
 }
 
+extern int get_compat_itimerspec(struct itimerspec *dst,
+const struct compat_itimerspec __user *src);
+extern int put_compat_itimerspec(struct compat_itimerspec __user *dst,
+const struct itimerspec *src);
+
 asmlinkage long compat_sys_adjtimex(struct compat_timex __user *utp);
 
 extern int compat_printk(const char *fmt, ...);
Index: linux-2.6.21-rc3.quilt/kernel/compat.c
===
--- linux-2.6.21-rc3.quilt.orig/kernel/compat.c 2007-03-19 16:03:26.0 
-0700
+++ linux-2.6.21-rc3.quilt/kernel/compat.c  2007-03-19 16:41:38.0 
-0700
@@ -475,8 +475,8 @@
return min_length;
 }
 
-static int get_compat_itimerspec(struct itimerspec *dst, 
-struct compat_itimerspec __user *src)
+int get_compat_itimerspec(struct itimerspec *dst,
+ const struct compat_itimerspec __user *src)
 { 
if (get_compat_timespec(>it_interval, >it_interval) ||
get_compat_timespec(>it_value, >it_value))
@@ -484,8 +484,8 @@
return 0;
 } 
 
-static int put_compat_itimerspec(struct compat_itimerspec __user *dst, 
-struct itimerspec *src)
+int put_compat_itimerspec(struct compat_itimerspec __user *dst,
+ const struct itimerspec *src)
 { 
if (put_compat_timespec(>it_interval, >it_interval) ||
put_compat_timespec(>it_value, >it_value))

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 7/13] signal/timer/event fds v7 - timerfd wire up i386 arch ...

2007-03-19 Thread Davide Libenzi
This patch wire the timerfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/i386/kernel/syscall_table.S
2007-03-19 16:41:29.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S 2007-03-19 
16:41:35.0 -0700
@@ -320,3 +320,4 @@
.long sys_getcpu
.long sys_epoll_pwait
.long sys_signalfd  /* 320 */
+   .long sys_timerfd
Index: linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-i386/unistd.h   2007-03-19 
16:41:29.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h2007-03-19 
16:41:35.0 -0700
@@ -326,10 +326,11 @@
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
 #define __NR_signalfd  320
+#define __NR_timerfd   321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 321
+#define NR_syscalls 322
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 13/13] signal/timer/event fds v7 - KAIO eventfd support example ...

2007-03-19 Thread Davide Libenzi
This is an example about how to add eventfd support to the current KAIO code,
in order to enable KAIO to post readiness events to a pollable fd
(hence compatible with POSIX select/poll). The KAIO code simply signals
the eventfd fd when events are ready, and this triggers a POLLIN in the fd.
This patch uses a reserved for future use member of the struct iocb to pass
an eventfd file descriptor, that KAIO will use to post events every time
a request completes. At that point, an aio_getevents() will return the
completed result to a struct io_event.
I made a quick test program to verify the patch, and it runs fine here:

http://www.xmailserver.org/eventfd-aio-test.c

The test program uses poll(2), but it'd, of course, work with select and
epoll too.
This can allow to schedule both block I/O and other poll-able devices requests,
and wait for results using select/poll/epoll.
In a typical scenario, an application would submit KAIO request using 
aio_submit(),
and will also use epoll_ctl() on the whole other class of devices (that
with the addition of signals, timers and user events, now it's pretty much
complete), and then would:

epoll_wait(...);
for_each_event {
if (curr_event_is_kaiofd) {
aio_getevents();
dispatch_aio_events();
} else {
dispatch_epoll_event();
}
}



Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/aio.c
===
--- linux-2.6.21-rc3.quilt.orig/fs/aio.c2007-03-19 16:03:25.0 
-0700
+++ linux-2.6.21-rc3.quilt/fs/aio.c 2007-03-19 16:41:45.0 -0700
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -421,6 +422,7 @@
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(>ki_run_list);
+   req->ki_eventfd = ERR_PTR(-EINVAL);
 
/* Check if the completion queue has enough free space to
 * accept an event from this io.
@@ -462,6 +464,8 @@
 {
assert_spin_locked(>ctx_lock);
 
+   if (!IS_ERR(req->ki_eventfd))
+   fput(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != >ki_inline_vec)
@@ -946,6 +950,14 @@
return 1;
}
 
+   /*
+* Check if the user asked us to deliver the result through an
+* eventfd. The eventfd_signal() function is safe to be called
+* from IRQ context.
+*/
+   if (unlikely(!IS_ERR(iocb->ki_eventfd)))
+   eventfd_signal(iocb->ki_eventfd, 1);
+
info = >ring_info;
 
/* add a completion event to the ring buffer.
@@ -1555,6 +1567,19 @@
fput(file);
return -EAGAIN;
}
+   if (iocb->aio_resfd != 0) {
+   /*
+* If the aio_resfd field of the iocb is not zero, get an
+* instance of the file* now. The file descriptor must be
+* an eventfd() fd, and will be signaled for each completed
+* event using the eventfd_signal() function.
+*/
+   req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
+   if (IS_ERR(req->ki_eventfd)) {
+   ret = PTR_ERR(req->ki_eventfd);
+   goto out_put_req;
+   }
+   }
 
req->ki_filp = file;
ret = put_user(req->ki_key, _iocb->aio_key);
Index: linux-2.6.21-rc3.quilt/include/linux/aio.h
===
--- linux-2.6.21-rc3.quilt.orig/include/linux/aio.h 2007-03-19 
16:03:25.0 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/aio.h  2007-03-19 16:41:45.0 
-0700
@@ -119,6 +119,12 @@
 
struct list_headki_list;/* the aio core uses this
 * for cancellation */
+
+   /*
+* If the aio_resfd field of the userspace iocb is not zero,
+* this is the underlying file* to deliver event to.
+*/
+   struct file *ki_eventfd;
 };
 
 #define is_sync_kiocb(iocb)((iocb)->ki_key == KIOCB_SYNC_KEY)
Index: linux-2.6.21-rc3.quilt/include/linux/aio_abi.h
===
--- linux-2.6.21-rc3.quilt.orig/include/linux/aio_abi.h 2007-03-19 
16:03:25.0 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/aio_abi.h  2007-03-19 
16:41:45.0 -0700
@@ -84,7 +84,11 @@
 
/* extra parameters */
__u64   aio_reserved2;  /* TODO: use this for a (struct sigevent *) */
-   __u64   aio_reserved3;
+   __u32   aio_reserved3;
+   /*
+* If different from 0, this is an eventfd to deliver AIO results to
+*/
+   __u32   aio_resfd;
 }; /* 64 bytes */
 
 #undef IFBIG

-
To unsubscribe from this 

[patch 6/13] signal/timer/event fds v7 - timerfd core ...

2007-03-19 Thread Davide Libenzi
This patch introduces a new system call for timers events delivered
though file descriptors. This allows timer event to be used with
standard POSIX poll(2), select(2) and read(2). As a consequence of
supporting the Linux f_op->poll subsystem, they can be used with
epoll(2) too.
The system call is defined as:

int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);

The "ufd" parameter allows for re-use (re-programming) of an existing
timerfd w/out going through the close/open cycle (same as signalfd).
If "ufd" is -1, s new file descriptor will be created, otherwise the
existing "ufd" will be re-programmed.
The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME.
The time specified in the "utmr->it_value" parameter is the expiry
time for the timer.
If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute
time, otherwise it's a relative time.
If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should
be generated.
The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create
a timerfd without the timer enabled.
The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.
As stated before, the timerfd file descriptor supports poll(2), select(2)
and epoll(2). When a timer event happened on the timerfd, a POLLIN mask
will be returned.
The read(2) call can be used, and it will return a u32 variable holding
the number of "ticks" that happened on the interface since the last call
to read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN
will be returned if no ticks happened.
A quick test program, shows timerfd working correctly on my amd64 box:

http://www.xmailserver.org/timerfd-test.c




Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/timerfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc3.quilt/fs/timerfd.c 2007-03-19 16:41:33.0 -0700
@@ -0,0 +1,243 @@
+/*
+ *  fs/timerfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ *
+ *  Thanks to Thomas Gleixner for code reviews and useful comments.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct timerfd_ctx {
+   struct hrtimer tmr;
+   ktime_t tintv;
+   spinlock_t lock;
+   wait_queue_head_t wqh;
+   unsigned long ticks;
+};
+
+
+static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr);
+static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+ const struct itimerspec *ktmr);
+static int timerfd_close(struct inode *inode, struct file *file);
+static unsigned int timerfd_poll(struct file *file, poll_table *wait);
+static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
+   loff_t *ppos);
+
+
+
+static const struct file_operations timerfd_fops = {
+   .release= timerfd_close,
+   .poll   = timerfd_poll,
+   .read   = timerfd_read,
+};
+static struct kmem_cache *timerfd_ctx_cachep;
+
+
+
+static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
+{
+   struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, tmr);
+   enum hrtimer_restart rval = HRTIMER_NORESTART;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   ctx->ticks++;
+   wake_up_locked(>wqh);
+   if (ctx->tintv.tv64 != 0) {
+   hrtimer_forward(htmr, hrtimer_cb_get_time(htmr), ctx->tintv);
+   rval = HRTIMER_RESTART;
+   }
+   spin_unlock_irqrestore(>lock, flags);
+
+   return rval;
+}
+
+static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+ const struct itimerspec *ktmr)
+{
+   enum hrtimer_mode htmode;
+   ktime_t texp;
+
+   htmode = (flags & TFD_TIMER_ABSTIME) ? HRTIMER_MODE_ABS: 
HRTIMER_MODE_REL;
+
+   texp = timespec_to_ktime(ktmr->it_value);
+   ctx->ticks = 0;
+   ctx->tintv = timespec_to_ktime(ktmr->it_interval);
+   hrtimer_init(>tmr, clockid, htmode);
+   ctx->tmr.expires = texp;
+   ctx->tmr.function = timerfd_tmrproc;
+   if (texp.tv64 != 0)
+   hrtimer_start(>tmr, texp, htmode);
+}
+
+asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
+   const struct itimerspec __user *utmr)
+{
+   int error;
+   struct timerfd_ctx *ctx;
+   struct file *file;
+   struct inode *inode;
+   struct itimerspec ktmr;
+
+   if (copy_from_user(, utmr, sizeof(ktmr)))
+   return -EFAULT;
+
+   if (clockid != 

[patch 1/13] signal/timer/event fds v7 - anonymous inode source ...

2007-03-19 Thread Davide Libenzi
This patch add an anonymous inode source, to be used for files that need 
and inode only in order to create a file*. We do not care of having an 
inode for each file, and we do not even care of having different names in 
the associated dentries (dentry names will be same for classes of file*).
This allow code reuse, and will be used by epoll, signalfd and timerfd 
(and whatever else there'll be).



Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/anon_inodes.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc3.quilt/fs/anon_inodes.c 2007-03-18 13:32:52.0 
-0700
@@ -0,0 +1,204 @@
+/*
+ *  fs/anon_inodes.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ *  Thanks to Arnd Bergmann for code review and suggestions.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+static int ainofs_delete_dentry(struct dentry *dentry);
+static struct inode *aino_getinode(void);
+static struct inode *aino_mkinode(void);
+static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
+const char *dev_name, void *data, struct vfsmount 
*mnt);
+
+
+
+static struct vfsmount *aino_mnt __read_mostly;
+static struct inode *aino_inode;
+static const struct file_operations aino_fops = { };
+static struct file_system_type aino_fs_type = {
+   .name   = "ainofs",
+   .get_sb = ainofs_get_sb,
+   .kill_sb= kill_anon_super,
+};
+static struct dentry_operations ainofs_dentry_operations = {
+   .d_delete   = ainofs_delete_dentry,
+};
+
+
+
+/**
+ * aino_getfd - creates a new file instance by hooking it up to and anonymous
+ *  inode, and a dentry that describe the "class" of the file
+ * @pfd: [out]   pointer to the file descriptor
+ * @dpinode: [out]   pointer to the inode
+ * @pfile:   [out]   pointer to the file struct
+ * @name:[in]name of the "class" of the new file
+ * @fops [in]file operations for the new file
+ * @priv [in]private data for the new file (will be file's 
private_data)
+ *
+ * Creates a new file by hooking it on a single inode. This is useful for files
+ * that do not need to have a full-fledged inode in order to operate correctly.
+ * All the files created with aino_getfd() will share a single inode, by hence
+ * saving memory and avoiding code duplication for the file/inode/dentry setup.
+ */
+int aino_getfd(int *pfd, struct inode **pinode, struct file **pfile,
+  char const *name, const struct file_operations *fops, void *priv)
+{
+   struct qstr this;
+   struct dentry *dentry;
+   struct inode *inode;
+   struct file *file;
+   int error, fd;
+
+   error = -ENFILE;
+   file = get_empty_filp();
+   if (!file)
+   goto eexit_1;
+
+   inode = aino_getinode();
+   if (IS_ERR(inode)) {
+   error = PTR_ERR(inode);
+   goto eexit_2;
+   }
+
+   error = get_unused_fd();
+   if (error < 0)
+   goto eexit_3;
+   fd = error;
+
+   /*
+* Link the inode to a directory entry by creating a unique name
+* using the inode sequence number.
+*/
+   error = -ENOMEM;
+   this.name = name;
+   this.len = strlen(name);
+   this.hash = 0;
+   dentry = d_alloc(aino_mnt->mnt_sb->s_root, );
+   if (!dentry)
+   goto eexit_4;
+   dentry->d_op = _dentry_operations;
+   /* Do not publish this dentry inside the global dentry hash table */
+   dentry->d_flags &= ~DCACHE_UNHASHED;
+   d_instantiate(dentry, inode);
+
+   file->f_path.mnt = mntget(aino_mnt);
+   file->f_path.dentry = dentry;
+   file->f_mapping = inode->i_mapping;
+
+   file->f_pos = 0;
+   file->f_flags = O_RDWR;
+   file->f_op = fops;
+   file->f_mode = FMODE_READ | FMODE_WRITE;
+   file->f_version = 0;
+   file->private_data = priv;
+
+   fd_install(fd, file);
+
+   *pfd = fd;
+   *pinode = inode;
+   *pfile = file;
+   return 0;
+
+eexit_4:
+   put_unused_fd(fd);
+eexit_3:
+   iput(inode);
+eexit_2:
+   put_filp(file);
+eexit_1:
+   return error;
+}
+
+static int ainofs_delete_dentry(struct dentry *dentry)
+{
+   /*
+* We faked vfs to believe the dentry was hashed when we created it.
+* Now we restore the flag so that dput() will work correctly.
+*/
+   dentry->d_flags |= DCACHE_UNHASHED;
+   return 1;
+}
+
+static struct inode *aino_getinode(void)
+{
+   return igrab(aino_inode);
+}
+
+/*
+ * A single inode exist for all aino files. On the contrary of pipes,
+ * aino inodes has no per-instance data associated, so we can avoid
+ * the allocation of multiple of them.
+ */
+static struct inode *aino_mkinode(void)
+{
+   int error = -ENOMEM;
+   

[patch 3/13] signal/timer/event fds v7 - signalfd wire up i386 arch ...

2007-03-19 Thread Davide Libenzi
This patch wire the signalfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/i386/kernel/syscall_table.S
2007-03-19 16:03:26.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S 2007-03-19 
16:41:29.0 -0700
@@ -319,3 +319,4 @@
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_signalfd  /* 320 */
Index: linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-i386/unistd.h   2007-03-19 
16:03:26.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h2007-03-19 
16:41:29.0 -0700
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_signalfd  320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andreas Steinmetz
Andrew Morton wrote:
> On Tue, 20 Mar 2007 00:25:02 +0100
> Andreas Steinmetz <[EMAIL PROTECTED]> wrote:
> 
>> Mike Christie wrote:
>>> Mike Christie wrote:
 James Bottomley wrote:
> On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
>>> I can't even say if the tapes are written correctly as I can't read them
>>> (one does not reboot production machines back to 2.4.x just to try to
>>> read a backup tape - I don't have 2.6.x older than 2.6.20 on these
>>> machines).
>> Could you try this patch
>> http://marc.info/?l=linux-scsi=116464965414878=2
>> I thought st was modified to not send offsets in the last elements but
>> it looks like it wasn't.
> Actually, there are two patches in the email referred to.  If the
> analysis that we're passing NULL to mempool_free is correct, it should
> be the second one that fixes the problem (the one that checks
> bio->bi_io_vec before freeing it).  Which would mean we have a
> nr_vecs==0 bio generated by the tar somehow.
>
 I think we might only need the first patch if the problem is similar to
 what the lsi guys were seeing. I thought the problem is that we are not
 estimating how large the transfer is correctly because we do not take
 into account offsets at the end. This results in nr_vecs being zero when
 it should be a valid value. I thought Kai's patch:
 http://bugzilla.kernel.org/show_bug.cgi?id=7919
 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
 fixed the problem on st's side,
>>> Oh, I noticed that the subject for the mail references 2.6.30.3 and the
>>> patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
>>> Could we try the st patch in the bugzilla first?
>> Ok, the st patch from bugzilla solves the problem (tested on both
>> affected machines).
> 
> 
> If you're referring to the below patch then it's already in mainline, and
> has been for a month.
> 

Yes, that's the patch I'm referring to.

> Have you tested 2.6.21-rc4?  If not, please do so.
> 

Sorry, this is not possible on these machines. They are production
servers and every problem on them that cannot be easily solved via
remote access is a 40km (one way) drive in the middle of the night.

> Perhaps we should merge this into 2.6.20.x?
> 

I would suggest so.

> 
> 
> commit 9abe16c670bd3d4ab5519257514f9f291383d104
> Author: Kai Makisara <[EMAIL PROTECTED]>
> Date:   Sat Feb 3 13:21:29 2007 +0200
> 
> [SCSI] st: fix Tape dies if wrong block size used, bug 7919
> 
> On Thu, 1 Feb 2007, Andrew Morton wrote:
> > On Thu, 1 Feb 2007 15:34:29 -0800
> > [EMAIL PROTECTED] wrote:
> >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=7919
> > >
> > >Summary: Tape dies if wrong block size used
> > > Kernel Version: 2.6.20-rc5
> > > Status: NEW
> > >   Severity: normal
> > >  Owner: [EMAIL PROTECTED]
> > >  Submitter: [EMAIL PROTECTED]
> > >
> > >
> > > Most recent kernel where this bug did *NOT* occur: 2.6.17.14
> > >
> > > Other Kernels Tested and Results:
> > >
> > > OK 2.6.15.7
> > > OK 2.6.16.37
> > > OK 2.6.17.14
> > > BAD 2.6.18.6
> > > BAD 2.6.18-1.2869.fc6
> > > BAD 2.6.19.2 +
> > > BAD 2.6.20-rc5
> > >
> > > NOTE: 2.6.18-1.2869.fc6 is a Fedora modified kernel, all others are 
> from kernel.org
> > >
> ...
> > > Steps to reproduce:
> > > Get a Adaptec AHA-2940U/UW/D / AIC-7881U card and a tape drive,
> > > install a recent kernel
> > > set the tape block size - mt setblk 4096
> > > read from or write to tape using wrong block size - tar -b 7 -cvf 
> /dev/tape foo
> > >
> Write does not trigger this bug because the driver refuses in fixed block
> mode writes that are not a multiple of the block size. Read does trigger
> it in my system.
> 
> The bug is not associated with any specific HBA. st tries to do direct i/o
> in fixed block mode with reads that are not a multiple of tape block size.
> 
> The patch in this message fixes the st problem by switching to using the
> driver buffer up to the next close of the device file in fixed block mode
> if the user asks for a read like this.
> 
> I don't know why the bug has surfaced only after 2.6.17 although the st
> problem is old. There may be another bug in the block subsystem and this
> patch works around it. However, the patch fixes a problem in st and in
> this way it is a valid fix.
> 
> This patch may also fix the bug 7900.
> 
> The patch compiles and is lightly tested.
> 
> Signed-off-by: Kai Makisara <[EMAIL PROTECTED]>
> Signed-off-by: James Bottomley <[EMAIL PROTECTED]>
> 
> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index 

Re: [PATCH] Complain about missing system calls.

2007-03-19 Thread Andrew Morton
On Thu, 08 Mar 2007 23:01:13 +
David Woodhouse <[EMAIL PROTECTED]> wrote:

> Most system calls seem to get added to i386 first. This patch
> automatically generates a warning for any new system call which is
> implemented on i386 but not the architecture currently being compiled.
> On PowerPC at the moment, for example, it results in these warnings:
> init/missing_syscalls.h:935:3: warning: #warning syscall sync_file_range not 
> implemented
> init/missing_syscalls.h:947:3: warning: #warning syscall getcpu not 
> implemented
> init/missing_syscalls.h:950:3: warning: #warning syscall epoll_pwait not 
> implemented

hm, did you try running this on x86_64?

In file included from init/missing_syscalls.c:73:
init/missing_syscalls.h:23:3: warning: #warning syscall waitpid not implemented
init/missing_syscalls.h:68:3: warning: #warning syscall umount not implemented
init/missing_syscalls.h:77:3: warning: #warning syscall stime not implemented
init/missing_syscalls.h:104:3: warning: #warning syscall nice not implemented
init/missing_syscalls.h:146:3: warning: #warning syscall signal not implemented
init/missing_syscalls.h:203:3: warning: #warning syscall sigaction not 
implemented
init/missing_syscalls.h:206:3: warning: #warning syscall sgetmask not 
implemented
init/missing_syscalls.h:209:3: warning: #warning syscall ssetmask not 
implemented
init/missing_syscalls.h:218:3: warning: #warning syscall sigsuspend not 
implemented
init/missing_syscalls.h:221:3: warning: #warning syscall sigpending not 
implemented
init/missing_syscalls.h:269:3: warning: #warning syscall readdir not implemented
init/missing_syscalls.h:308:3: warning: #warning syscall socketcall not 
implemented
init/missing_syscalls.h:353:3: warning: #warning syscall ipc not implemented
init/missing_syscalls.h:359:3: warning: #warning syscall sigreturn not 
implemented
init/missing_syscalls.h:380:3: warning: #warning syscall sigprocmask not 
implemented
init/missing_syscalls.h:404:3: warning: #warning syscall bdflush not implemented
init/missing_syscalls.h:422:3: warning: #warning syscall _llseek not implemented
init/missing_syscalls.h:428:3: warning: #warning syscall _newselect not 
implemented
init/missing_syscalls.h:800:3: warning: #warning syscall statfs64 not 
implemented
init/missing_syscalls.h:803:3: warning: #warning syscall fstatfs64 not 
implemented
init/missing_syscalls.h:947:3: warning: #warning syscall getcpu not implemented
init/missing_syscalls.h:950:3: warning: #warning syscall epoll_pwait not 
implemented
init/missing_syscalls.h:953:3: warning: #warning syscall lutimesat not 
implemented
init/missing_syscalls.h:956:3: warning: #warning syscall revokeat not 
implemented
init/missing_syscalls.h:959:3: warning: #warning syscall frevoke not implemented
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kref refcounting breakage in mainline

2007-03-19 Thread Randy Dunlap
On Thu, 15 Mar 2007 07:54:14 -0700 Greg KH wrote:

> On Thu, Mar 15, 2007 at 11:19:20AM +0100, Mike Galbraith wrote:
> > On Thu, 2007-03-15 at 01:06 -0700, Greg KH wrote:
> > 
> > > That's good.  But why don't we have a module name for this driver?
> > > 
> > > And if we don't have a module name, why would there be a symlink to
> > > remove?  That's what is keeping your module from unloading, right?
> > 
> > You keep saying "module", and that's making me a bit nervous ;-)
> > 
> > Just to be sure we're not talking past each other, when you say module,
> > don't mean the modprobe kind... i hope.  This "module" as in driver is
> > compiled in.  (said that before, but you may have missed it)
> 
> Ahh, that changes everything here, thanks for letting me know, I had
> missed this.
> 
> The problem is that the module_init() is failing, yet this isn't really
> a module, it's built into the kernel.  So some of the module teardown
> logic is dieing when it thinks that we really have a full module
> structure here (owner and such).

Urgh, it's not a "loadable" module, but it's still a logical module.

> I'll look at this further tomorrow, as I'm travelling pretty much all
> day today, sorry.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable

2007-03-19 Thread Andi Kleen
> Possibly not, but I'd like to be able to say with confidence that
> running a PARAVIRT kernel on bare hardware has no performance loss
> compared to running a !PARAVIRT kernel.  There's the case of small
> instruction sequences which have been replaced with calls (such as
> sti/cli/push;popf/etc), 

My guess is that most critical pushf/popf are in spin_lock_irqsave(). It would 
be possible to special case that one -- inline it -- and use out of line
versions for all the others.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[QUICKLIST 3/5] Quicklist support for i386

2007-03-19 Thread Christoph Lameter
i386: Convert to quicklists

Implement the i386 management of pgd and pmds using quicklists.

The i386 management of page table pages currently uses page sized slabs.
Getting rid of that using quicklists allows full use of the page flags
and the page->lru. So get rid of the improvised linked lists using
page->index and page->private.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3-mm2/arch/i386/mm/init.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/init.c   2007-03-19 
15:54:28.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/init.c2007-03-19 15:55:33.0 
-0700
@@ -695,31 +695,6 @@ int remove_memory(u64 start, u64 size)
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif
 
-struct kmem_cache *pgd_cache;
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
-   if (PTRS_PER_PMD > 1) {
-   pmd_cache = kmem_cache_create("pmd",
-   PTRS_PER_PMD*sizeof(pmd_t),
-   PTRS_PER_PMD*sizeof(pmd_t),
-   0,
-   pmd_ctor,
-   NULL);
-   if (!pmd_cache)
-   panic("pgtable_cache_init(): cannot create pmd cache");
-   }
-   pgd_cache = kmem_cache_create("pgd",
-   PTRS_PER_PGD*sizeof(pgd_t),
-   PTRS_PER_PGD*sizeof(pgd_t),
-   0,
-   pgd_ctor,
-   PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
-   if (!pgd_cache)
-   panic("pgtable_cache_init(): Cannot create pgd cache");
-}
-
 /*
  * This function cannot be __init, since exceptions don't work in that
  * section.  Put this after the callers, so that it cannot be inlined.
Index: linux-2.6.21-rc3-mm2/arch/i386/mm/pgtable.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/pgtable.c2007-03-19 
15:54:28.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/pgtable.c 2007-03-19 15:59:37.0 
-0700
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -198,11 +199,6 @@ struct page *pte_alloc_one(struct mm_str
return pte;
 }
 
-void pmd_ctor(void *pmd, struct kmem_cache *cache, unsigned long flags)
-{
-   memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t));
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -211,36 +207,18 @@ void pmd_ctor(void *pmd, struct kmem_cac
  * against pageattr.c; it is the unique case in which a valid change
  * of kernel pagetables can't be lazily synchronized by vmalloc faults.
  * vmalloc faults work because attached pagetables are never freed.
- * The locking scheme was chosen on the basis of manfred's
- * recommendations and having no core impact whatsoever.
  * -- wli
  */
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
-
-static inline void pgd_list_add(pgd_t *pgd)
-{
-   struct page *page = virt_to_page(pgd);
-   page->index = (unsigned long)pgd_list;
-   if (pgd_list)
-   set_page_private(pgd_list, (unsigned long)>index);
-   pgd_list = page;
-   set_page_private(page, (unsigned long)_list);
-}
+LIST_HEAD(pgd_list);
 
-static inline void pgd_list_del(pgd_t *pgd)
-{
-   struct page *next, **pprev, *page = virt_to_page(pgd);
-   next = (struct page *)page->index;
-   pprev = (struct page **)page_private(page);
-   *pprev = next;
-   if (next)
-   set_page_private(next, (unsigned long)pprev);
-}
+#define QUICK_PGD 0
+#define QUICK_PMD 1
 
-void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_ctor(void *pgd)
 {
unsigned long flags;
+   struct page *page = virt_to_page(pgd);
 
if (PTRS_PER_PMD == 1) {
memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
@@ -259,31 +237,32 @@ void pgd_ctor(void *pgd, struct kmem_cac
__pa(swapper_pg_dir) >> PAGE_SHIFT,
USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
 
-   pgd_list_add(pgd);
+   list_add(>lru, _list);
spin_unlock_irqrestore(_lock, flags);
 }
 
 /* never called when PTRS_PER_PMD > 1 */
-void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_dtor(void *pgd)
 {
unsigned long flags; /* can be called from interrupt context */
+   struct page *page = virt_to_page(pgd);
 
paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
spin_lock_irqsave(_lock, flags);
-   pgd_list_del(pgd);
+   list_del(>lru);
spin_unlock_irqrestore(_lock, flags);
 }
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
int i;
-   pgd_t *pgd = 

[QUICKLIST 4/5] Quicklist support for x86_64

2007-03-19 Thread Christoph Lameter
Conver x86_64 to using quicklists

This adds caching of pgds and puds, pmds, pte. That way we can
avoid costly zeroing and initialization of special mappings in the
pgd.

A second quicklist is used to separate out PGD handling. Thus we can carry
the initialized pgds of terminating processes over to the next process
needing them.

Also clean up the pgd_list handling to use regular list macros. Not using
the slab allocator frees up the lru field so we can use regular list macros.

The adding and removal of the pgds to the pgdlist is moved into the
constructor / destructor. We can then avoid moving pgds off the list that
are still in the quicklists reducing the pds creation and allocation
overhead further.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig
===
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/Kconfig   2007-03-13 
00:09:50.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig2007-03-15 22:00:04.0 
-0700
@@ -56,6 +56,14 @@ config ZONE_DMA
bool
default y
 
+config QUICKLIST
+   bool
+   default y
+
+config NR_QUICK
+   int
+   default 2
+
 config ISA
bool
 
Index: linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h
===
--- linux-2.6.21-rc3-mm2.orig/include/asm-x86_64/pgalloc.h  2007-03-13 
00:09:50.0 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h   2007-03-15 
21:59:31.0 -0700
@@ -4,6 +4,10 @@
 #include 
 #include 
 #include 
+#include 
+
+#define QUICK_PGD 0/* We preserve special mappings over free */
+#define QUICK_PT 1 /* Other page table pages that are zero on free */
 
 #define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
@@ -20,86 +24,77 @@ static inline void pmd_populate(struct m
 static inline void pmd_free(pmd_t *pmd)
 {
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-   free_page((unsigned long)pmd);
+   quicklist_free(QUICK_PT, NULL, pmd);
 }
 
 static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-   return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, 
NULL);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, 
NULL);
 }
 
 static inline void pud_free (pud_t *pud)
 {
BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-   free_page((unsigned long)pud);
+   quicklist_free(QUICK_PT, NULL, pud);
 }
 
-static inline void pgd_list_add(pgd_t *pgd)
+static inline void pgd_ctor(void *x)
 {
+   unsigned boundary;
+   pgd_t *pgd = x;
struct page *page = virt_to_page(pgd);
 
+   /*
+* Copy kernel pointers in from init.
+*/
+   boundary = pgd_index(__PAGE_OFFSET);
+   memcpy(pgd + boundary,
+   init_level4_pgt + boundary,
+   (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+
spin_lock(_lock);
-   page->index = (pgoff_t)pgd_list;
-   if (pgd_list)
-   pgd_list->private = (unsigned long)>index;
-   pgd_list = page;
-   page->private = (unsigned long)_list;
+   list_add(>lru, _list);
spin_unlock(_lock);
 }
 
-static inline void pgd_list_del(pgd_t *pgd)
+static inline void pgd_dtor(void *x)
 {
-   struct page *next, **pprev, *page = virt_to_page(pgd);
+   pgd_t *pgd = x;
+   struct page *page = virt_to_page(pgd);
 
spin_lock(_lock);
-   next = (struct page *)page->index;
-   pprev = (struct page **)page->private;
-   *pprev = next;
-   if (next)
-   next->private = (unsigned long)pprev;
+   list_del(>lru);
spin_unlock(_lock);
 }
 
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   unsigned boundary;
-   pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-   if (!pgd)
-   return NULL;
-   pgd_list_add(pgd);
-   /*
-* Copy kernel pointers in from init.
-* Could keep a freelist or slab cache of those because the kernel
-* part never changes.
-*/
-   boundary = pgd_index(__PAGE_OFFSET);
-   memset(pgd, 0, boundary * sizeof(pgd_t));
-   memcpy(pgd + boundary,
-  init_level4_pgt + boundary,
-  (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+   pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
+GFP_KERNEL|__GFP_REPEAT, pgd_ctor);
+
return pgd;
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
BUG_ON((unsigned long)pgd & (PAGE_SIZE-1));
-   pgd_list_del(pgd);
-   free_page((unsigned long)pgd);
+   quicklist_free(QUICK_PGD, pgd_dtor, pgd);
 }

Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andrew Morton
On Tue, 20 Mar 2007 00:25:02 +0100
Andreas Steinmetz <[EMAIL PROTECTED]> wrote:

> Mike Christie wrote:
> > Mike Christie wrote:
> >> James Bottomley wrote:
> >>> On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
> > I can't even say if the tapes are written correctly as I can't read them
> > (one does not reboot production machines back to 2.4.x just to try to
> > read a backup tape - I don't have 2.6.x older than 2.6.20 on these
> > machines).
>  Could you try this patch
>  http://marc.info/?l=linux-scsi=116464965414878=2
>  I thought st was modified to not send offsets in the last elements but
>  it looks like it wasn't.
> >>> Actually, there are two patches in the email referred to.  If the
> >>> analysis that we're passing NULL to mempool_free is correct, it should
> >>> be the second one that fixes the problem (the one that checks
> >>> bio->bi_io_vec before freeing it).  Which would mean we have a
> >>> nr_vecs==0 bio generated by the tar somehow.
> >>>
> >> I think we might only need the first patch if the problem is similar to
> >> what the lsi guys were seeing. I thought the problem is that we are not
> >> estimating how large the transfer is correctly because we do not take
> >> into account offsets at the end. This results in nr_vecs being zero when
> >> it should be a valid value. I thought Kai's patch:
> >> http://bugzilla.kernel.org/show_bug.cgi?id=7919
> >> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
> >> fixed the problem on st's side,
> > 
> > Oh, I noticed that the subject for the mail references 2.6.30.3 and the
> > patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
> > Could we try the st patch in the bugzilla first?
> 
> Ok, the st patch from bugzilla solves the problem (tested on both
> affected machines).


If you're referring to the below patch then it's already in mainline, and
has been for a month.

Have you tested 2.6.21-rc4?  If not, please do so.

Perhaps we should merge this into 2.6.20.x?



commit 9abe16c670bd3d4ab5519257514f9f291383d104
Author: Kai Makisara <[EMAIL PROTECTED]>
Date:   Sat Feb 3 13:21:29 2007 +0200

[SCSI] st: fix Tape dies if wrong block size used, bug 7919

On Thu, 1 Feb 2007, Andrew Morton wrote:
> On Thu, 1 Feb 2007 15:34:29 -0800
> [EMAIL PROTECTED] wrote:
>
> > http://bugzilla.kernel.org/show_bug.cgi?id=7919
> >
> >Summary: Tape dies if wrong block size used
> > Kernel Version: 2.6.20-rc5
> > Status: NEW
> >   Severity: normal
> >  Owner: [EMAIL PROTECTED]
> >  Submitter: [EMAIL PROTECTED]
> >
> >
> > Most recent kernel where this bug did *NOT* occur: 2.6.17.14
> >
> > Other Kernels Tested and Results:
> >
> > OK 2.6.15.7
> > OK 2.6.16.37
> > OK 2.6.17.14
> > BAD 2.6.18.6
> > BAD 2.6.18-1.2869.fc6
> > BAD 2.6.19.2 +
> > BAD 2.6.20-rc5
> >
> > NOTE: 2.6.18-1.2869.fc6 is a Fedora modified kernel, all others are 
from kernel.org
> >
...
> > Steps to reproduce:
> > Get a Adaptec AHA-2940U/UW/D / AIC-7881U card and a tape drive,
> > install a recent kernel
> > set the tape block size - mt setblk 4096
> > read from or write to tape using wrong block size - tar -b 7 -cvf 
/dev/tape foo
> >
Write does not trigger this bug because the driver refuses in fixed block
mode writes that are not a multiple of the block size. Read does trigger
it in my system.

The bug is not associated with any specific HBA. st tries to do direct i/o
in fixed block mode with reads that are not a multiple of tape block size.

The patch in this message fixes the st problem by switching to using the
driver buffer up to the next close of the device file in fixed block mode
if the user asks for a read like this.

I don't know why the bug has surfaced only after 2.6.17 although the st
problem is old. There may be another bug in the block subsystem and this
patch works around it. However, the patch fixes a problem in st and in
this way it is a valid fix.

This patch may also fix the bug 7900.

The patch compiles and is lightly tested.

Signed-off-by: Kai Makisara <[EMAIL PROTECTED]>
Signed-off-by: James Bottomley <[EMAIL PROTECTED]>

diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index e016e09..fba8b20 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -9,7 +9,7 @@
Steve Hirsch, Andreas Koppenh"ofer, Michael Leodolter, Eyal Lebedinsky,
Michael Schaefer, J"org Weule, and Eric Youngdale.
 
-   Copyright 1992 - 2006 Kai Makisara
+   Copyright 1992 - 2007 Kai Makisara
email [EMAIL PROTECTED]
 
Some small formal changes - aeb, 950809
@@ -17,7 +17,7 @@
Last modified: 18-JAN-1998 Richard Gooch <[EMAIL PROTECTED]> Devfs 

[QUICKLIST 2/5] Quicklist support for IA64

2007-03-19 Thread Christoph Lameter
Quicklist for IA64

IA64 is the origin of the quicklist implementation. So cut out the pieces
that are now in core code and modify the functions called.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/ia64/mm/init.c   2007-03-16 
02:15:24.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c2007-03-16 02:33:46.0 
-0700
@@ -39,9 +39,6 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
-DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist);
-DEFINE_PER_CPU(long, __pgtable_quicklist_size);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x1UL;
@@ -56,54 +53,6 @@ EXPORT_SYMBOL(vmem_map);
 struct page *zero_page_memmap_ptr; /* map entry for zero page */
 EXPORT_SYMBOL(zero_page_memmap_ptr);
 
-#define MIN_PGT_PAGES  25UL
-#define MAX_PGT_FREES_PER_PASS 16L
-#define PGT_FRACTION_OF_NODE_MEM   16
-
-static inline long
-max_pgt_pages(void)
-{
-   u64 node_free_pages, max_pgt_pages;
-
-#ifndefCONFIG_NUMA
-   node_free_pages = nr_free_pages();
-#else
-   node_free_pages = node_page_state(numa_node_id(), NR_FREE_PAGES);
-#endif
-   max_pgt_pages = node_free_pages / PGT_FRACTION_OF_NODE_MEM;
-   max_pgt_pages = max(max_pgt_pages, MIN_PGT_PAGES);
-   return max_pgt_pages;
-}
-
-static inline long
-min_pages_to_free(void)
-{
-   long pages_to_free;
-
-   pages_to_free = pgtable_quicklist_size - max_pgt_pages();
-   pages_to_free = min(pages_to_free, MAX_PGT_FREES_PER_PASS);
-   return pages_to_free;
-}
-
-void
-check_pgt_cache(void)
-{
-   long pages_to_free;
-
-   if (unlikely(pgtable_quicklist_size <= MIN_PGT_PAGES))
-   return;
-
-   preempt_disable();
-   while (unlikely((pages_to_free = min_pages_to_free()) > 0)) {
-   while (pages_to_free--) {
-   free_page((unsigned long)pgtable_quicklist_alloc());
-   }
-   preempt_enable();
-   preempt_disable();
-   }
-   preempt_enable();
-}
-
 void
 lazy_mmu_prot_update (pte_t pte)
 {
Index: linux-2.6.21-rc3-mm2/include/asm-ia64/pgalloc.h
===
--- linux-2.6.21-rc3-mm2.orig/include/asm-ia64/pgalloc.h2007-03-16 
02:15:24.0 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-ia64/pgalloc.h 2007-03-16 
02:33:46.0 -0700
@@ -18,71 +18,18 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-DECLARE_PER_CPU(unsigned long *, __pgtable_quicklist);
-#define pgtable_quicklist __ia64_per_cpu_var(__pgtable_quicklist)
-DECLARE_PER_CPU(long, __pgtable_quicklist_size);
-#define pgtable_quicklist_size __ia64_per_cpu_var(__pgtable_quicklist_size)
-
-static inline long pgtable_quicklist_total_size(void)
-{
-   long ql_size = 0;
-   int cpuid;
-
-   for_each_online_cpu(cpuid) {
-   ql_size += per_cpu(__pgtable_quicklist_size, cpuid);
-   }
-   return ql_size;
-}
-
-static inline void *pgtable_quicklist_alloc(void)
-{
-   unsigned long *ret = NULL;
-
-   preempt_disable();
-
-   ret = pgtable_quicklist;
-   if (likely(ret != NULL)) {
-   pgtable_quicklist = (unsigned long *)(*ret);
-   ret[0] = 0;
-   --pgtable_quicklist_size;
-   preempt_enable();
-   } else {
-   preempt_enable();
-   ret = (unsigned long *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
-   }
-
-   return ret;
-}
-
-static inline void pgtable_quicklist_free(void *pgtable_entry)
-{
-#ifdef CONFIG_NUMA
-   int nid = page_to_nid(virt_to_page(pgtable_entry));
-
-   if (unlikely(nid != numa_node_id())) {
-   free_page((unsigned long)pgtable_entry);
-   return;
-   }
-#endif
-
-   preempt_disable();
-   *(unsigned long *)pgtable_entry = (unsigned long)pgtable_quicklist;
-   pgtable_quicklist = (unsigned long *)pgtable_entry;
-   ++pgtable_quicklist_size;
-   preempt_enable();
-}
-
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   return pgtable_quicklist_alloc();
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t * pgd)
 {
-   pgtable_quicklist_free(pgd);
+   quicklist_free(0, NULL, pgd);
 }
 
 #ifdef CONFIG_PGTABLE_4
@@ -94,12 +41,12 @@ pgd_populate(struct mm_struct *mm, pgd_t
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return pgtable_quicklist_alloc();
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pud_free(pud_t * pud)
 {
-   pgtable_quicklist_free(pud);
+   quicklist_free(0, NULL, pud);
 }
 #define __pud_free_tlb(tlb, pud)   pud_free(pud)
 #endif /* CONFIG_PGTABLE_4 */
@@ -112,12 +59,12 @@ 

[QUICKLIST 5/5] Quicklist support for sparc64

2007-03-19 Thread Christoph Lameter
From: David Miller <[EMAIL PROTECTED]>

[QUICKLIST]: Add sparc64 quicklist support.

I ported this to sparc64 as per the patch below, tested on
UP SunBlade1500 and 24 cpu Niagara T1000.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig
===
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/Kconfig  2007-03-13 
00:09:30.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig   2007-03-15 22:01:40.0 
-0700
@@ -26,6 +26,10 @@ config MMU
bool
default y
 
+config QUICKLIST
+   bool
+   default y
+
 config STACKTRACE_SUPPORT
bool
default y
Index: linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/mm/init.c2007-03-13 
00:09:30.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c 2007-03-15 22:00:44.0 
-0700
@@ -176,30 +176,6 @@ unsigned long sparc64_kern_sec_context _
 
 int bigkernel = 0;
 
-struct kmem_cache *pgtable_cache __read_mostly;
-
-static void zero_ctor(void *addr, struct kmem_cache *cache, unsigned long 
flags)
-{
-   clear_page(addr);
-}
-
-extern void tsb_cache_init(void);
-
-void pgtable_cache_init(void)
-{
-   pgtable_cache = kmem_cache_create("pgtable_cache",
- PAGE_SIZE, PAGE_SIZE,
- SLAB_HWCACHE_ALIGN |
- SLAB_MUST_HWCACHE_ALIGN,
- zero_ctor,
- NULL);
-   if (!pgtable_cache) {
-   prom_printf("Could not create pgtable_cache\n");
-   prom_halt();
-   }
-   tsb_cache_init();
-}
-
 #ifdef CONFIG_DEBUG_DCFLUSH
 atomic_t dcpage_flushes = ATOMIC_INIT(0);
 #ifdef CONFIG_SMP
Index: linux-2.6.21-rc3-mm2/arch/sparc64/mm/tsb.c
===
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/mm/tsb.c 2007-03-13 
00:09:30.0 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/mm/tsb.c  2007-03-15 22:00:44.0 
-0700
@@ -252,7 +252,7 @@ static const char *tsb_cache_names[8] = 
"tsb_1MB",
 };
 
-void __init tsb_cache_init(void)
+void __init pgtable_cache_init(void)
 {
unsigned long i;
 
Index: linux-2.6.21-rc3-mm2/include/asm-sparc64/pgalloc.h
===
--- linux-2.6.21-rc3-mm2.orig/include/asm-sparc64/pgalloc.h 2007-03-13 
00:09:30.0 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-sparc64/pgalloc.h  2007-03-15 
22:00:44.0 -0700
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -13,52 +14,50 @@
 #include 
 
 /* Page table allocation/freeing. */
-extern struct kmem_cache *pgtable_cache;
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
-   kmem_cache_free(pgtable_cache, pgd);
+   quicklist_free(0, NULL, pgd);
 }
 
 #define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return kmem_cache_alloc(pgtable_cache,
-   GFP_KERNEL|__GFP_REPEAT);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pmd_free(pmd_t *pmd)
 {
-   kmem_cache_free(pgtable_cache, pmd);
+   quicklist_free(0, NULL, pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
  unsigned long address)
 {
-   return kmem_cache_alloc(pgtable_cache,
-   GFP_KERNEL|__GFP_REPEAT);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 unsigned long address)
 {
-   return virt_to_page(pte_alloc_one_kernel(mm, address));
+   void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
+   return pg ? virt_to_page(pg) : NULL;
 }

 static inline void pte_free_kernel(pte_t *pte)
 {
-   kmem_cache_free(pgtable_cache, pte);
+   quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *ptepage)
 {
-   pte_free_kernel(page_address(ptepage));
+   quicklist_free(0, NULL, page_address(ptepage));
 }
 
 
@@ -66,6 +65,9 @@ static inline void pte_free(struct page 
 #define pmd_populate(MM,PMD,PTE_PAGE)  \
pmd_populate_kernel(MM,PMD,page_address(PTE_PAGE))
 
-#define check_pgt_cache()  do { } while (0)
+static inline void check_pgt_cache(void)
+{
+   quicklist_check(0, NULL);
+}
 
 #endif /* _SPARC64_PGALLOC_H */
-
To unsubscribe from this list: send the line "unsubscribe 

  1   2   3   4   5   6   7   8   9   10   >