Re: [PATCH 1/1] irqchip: exynos-combiner: Save IRQ enable set on suspend

2015-06-10 Thread Peter Chubb
>>>>> "Javier" == Javier Martinez Canillas  
>>>>> writes:

Javier> The Exynos interrupt combiner IP looses its state when the SoC
     s/looses/loses/

Peter C
-- 
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation: ARM: EXYNOS: Describe boot loaders interface

2015-06-06 Thread Peter Chubb
>>>>> "Krzysztof" == Krzysztof Kozlowski  writes:

Krzysztof> Various boot loaders for Exynos based boards use certain
Krzysztof> memory addresses during booting for different
Krzysztof> purposes. Mostly this is one of following : 1. as a CPU
Krzysztof> boot address, 2. for storing magic cookie related to low
Krzysztof> power mode (AFTR, sleep).

Krzysztof> The document, based solely on kernel source code, tries to
Krzysztof> group the information scattered over different files. This
Krzysztof> would help in the future when adding support for new SoC or
Krzysztof> when extending features related to low power modes.

Is it worth grabbing the info from u=boot and documenting it here
(it's not documented other than in the hardkenel U=Boot source)?

I can send you the info, or you can see it in
https://github.com/hardkernel/u-boot/blob/odroidxu3-v2012.07/board/samsung/smdk5420/lowlevel_init.S
at symbol nscode_base near line 104

-- 
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] debug: Do not permit CONFIG_DEBUG_STACK_USAGE=y on IA64 or PARISC

2012-07-26 Thread Peter Chubb
>>>>> "Ingo" == Ingo Molnar  writes:

Ingo> * James Bottomley  wrote:
>> Since the problem is an invalid assumption about how the stack
>> grows, why not just condition it on that.  We actually have a
>> config option for this: CONFIG_STACK_GROWSUP.  But for some reason
>> ia64 doesn't define this, why not, Tony?  It looks deliberate
>> because you have replaced a lot of
>> 
>> #ifdef CONFIG_STACK_GROWSUP
>> 
>> with
>> 
>> #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
>> 
>> but not all of them.

Ingo> Yes, that's another possible solution, assuming that it's really
Ingo> only about the up/down difference.

Ingo> Thanks,

IA64 has two stacks -- the standard one, that grows down, and the
register stack engine backing store, that grows up.  The usual
mechanisms for stack growth are used, so only some of the bits
predicated on `STACK_GROWSUP' are useful.

Peter C
--
Dr Peter Chubb  peter.chubb AT nicta.com.au
http://www.ssrg.nicta.com.au  Software Systems Research Group/NICTA
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Fix compilation with gcc 4.2

2007-08-08 Thread Peter Chubb

gcc-4.2 is a lot more picky about its symbol handling.  EXPORT_SYMBOL
no longer works on symbols that are undefined or defined with static scope.

For example, with CONFIG_PROFILE off, I see:

  kernel/profile.c:206: error: __ksymtab_profile_event_unregister causes a 
section type conflict
  kernel/profile.c:205: error: __ksymtab_profile_event_register causes a 
section type conflict

This patch moves the EXPORTs inside the #ifdef CONFIG_PROFILE, so we
only try to export symbols that are defined.

Also, in kernel/kprobes.c there's an EXPORT_SYMBOL_GPL() for
jprobes_return, which if CONFIG_JPROBES is undefined is a static
inline and gives the same error.

And in drivers/acpi/resources/rsxface.c, there's an
ACPI_EXPORT_SYMBOPL() for a static symbol. If it's static, it's not
accessible from outside the compilation unit, so should bot be exported.

These three changes allow building a zx1_defconfig kernel with gcc 4.2
on IA64.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

Index: linux-2.6-git/kernel/profile.c
===
--- linux-2.6-git.orig/kernel/profile.c 2007-08-09 12:10:19.921216500 +1000
+++ linux-2.6-git/kernel/profile.c  2007-08-09 12:10:26.061162039 +1000
@@ -199,11 +199,11 @@ EXPORT_SYMBOL_GPL(register_timer_hook);
 EXPORT_SYMBOL_GPL(unregister_timer_hook);
 EXPORT_SYMBOL_GPL(task_handoff_register);
 EXPORT_SYMBOL_GPL(task_handoff_unregister);
+EXPORT_SYMBOL_GPL(profile_event_register);
+EXPORT_SYMBOL_GPL(profile_event_unregister);
 
 #endif /* CONFIG_PROFILING */
 
-EXPORT_SYMBOL_GPL(profile_event_register);
-EXPORT_SYMBOL_GPL(profile_event_unregister);
 
 #ifdef CONFIG_SMP
 /*
Index: linux-2.6-gie/kernel/kprobes.c
===
--- linux-2.6-git.orig/kernel/kprobes.c 2007-08-09 12:14:48.898830198 +1000
+++ linux-2.6-git/kernel/kprobes.c  2007-08-09 14:09:50.180322576 +1000
@@ -1063,6 +1063,8 @@ EXPORT_SYMBOL_GPL(register_kprobe);
 EXPORT_SYMBOL_GPL(unregister_kprobe);
 EXPORT_SYMBOL_GPL(register_jprobe);
 EXPORT_SYMBOL_GPL(unregister_jprobe);
-EXPORT_SYMBOL_GPL(jprobe_return);
+
+#ifdef CONFIG_KPROBES
 EXPORT_SYMBOL_GPL(register_kretprobe);
 EXPORT_SYMBOL_GPL(unregister_kretprobe);
+#endif
Index: linux-2.6-git/drivers/acpi/resources/rsxface.c
===
--- linux-2.6-git.orig/drivers/acpi/resources/rsxface.c 2007-08-09 
13:06:59.040346772 +1000
+++ linux-2.6-git/drivers/acpi/resources/rsxface.c  2007-08-09 
13:12:03.125801491 +1000
@@ -474,8 +474,6 @@ acpi_rs_match_vendor_resource(struct acp
return (AE_CTRL_TERMINATE);
 }
 
-ACPI_EXPORT_SYMBOL(acpi_rs_match_vendor_resource)
-
 
/***
  *
  * FUNCTION:acpi_walk_resources


--
Dr Peter Chubb http://www.gelato.unsw.edu.au  [EMAIL PROTECTED]
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Deferred interrupt handling.

2007-07-18 Thread Peter Chubb

The problem you're having is essentially the same as the user-level
interrupt handler problem I've been dealing with for ages.

The basic rule is: don't share interrupts between devices on the host
and devices in the guest.  But you *can* share interrupts between
devices in a single guest.

If you want the code, see
http://www.gelato.unsw.edu.au/cgi-bin/viewvc.cgi/cvs/kernel/usrdrivers/latest/
and look at generic-irq.patch and fasync (which adds asynchronous notifications)

For the KVM work it'll need modifying a little, but the basic
infrastructure is there.

We've currently got this working to pass interrupts to a type-II (hosted)
virtual machine monitor running a guest kernel with native drivers.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-ia64 build warning messages

2007-06-06 Thread Peter Chubb
>>>>> "Russ" == Russ Anderson <[EMAIL PROTECTED]> writes:

Russ> Tony Luck wrote:
>> > I used the sn2_defconfig in the tree :)
>> 
>> So there is something odd happening.  Russ complained that he was
>> still seeing several errors from the sn2_defconfig build too when I
>> posted the "last fix" to Len.  But I don't see them when I build.

Russ> An additional data point.  I have a copy of Tony's test tree
Russ> pulled down on March 30th that builds without the warning
Russ> messages.  The copy of Tony's test tree pulled down on May 22nd
Russ> does have warning messages.  I'm building both with the same
Russ> compiler (etc).  I'm fairly certain a tree I pulled down in
Russ> April built without warnings.  I've since blown away that tree.

Change request 85bd2fddd68e757da8e1af98f857f61a3c9ce647 introduced
section-mismatch checking for vmlinux, which caused all these warnings
to become visible.

It looks as if gcc can create references from .sdata to .init.sdata
depending on what optimisations it chooses to do.  Ideally we could
teach gcc to put its constants in the same section they reference.
But I'm no gcc guru.  The alternative is to get modpost to ignore such
references, at the cost of perhaps missing a real problem somewhere.
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: sleeping function called from invalid context at kernel/fork.c:385

2007-06-03 Thread Peter Chubb

I see many many section mismatches when compiling with gcc 4.1 and
binutils 2.17.50.20070426   They appear to be from .sdata to
.init.data.

This is with basic zx1_defconfig with a few mods.

The reason appears to be compiler weirdness..


WARNING: init/built-in.o(.sdata+0x30): Section mismatch: reference to 
.init.data:ino (after 'root_mountflags')

(initramfs.s contains a 32-word table `head'.  Code like:
static __initdata struct hash {..} *head[32];

for (p = head; p < head + 32; p++)
is generating:
  .section .sdata
L24:
.data8 head#+256


Rather than adding 256 to head at run time, the compiler loads L24 and
uses that for the comparison.  This triggers the warning.




WARNING: arch/ia64/kernel/built-in.o(.sdata+0x110): Section mismatch: reference 
to .init.data:rsvd_region (between 'ia64_sal' and 'ia64_i_cache_stride_shift')
WARNING: mm/built-in.o(.sdata+0x48): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x50): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x58): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x60): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x68): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x70): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x78): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x80): Section mismatch: reference to 
.init.data:early_node_map before 'sysctl_lowmem_reserve_ratio' (at offset -0x0)
WARNING: mm/built-in.o(.sdata+0x3c8): Section mismatch: reference to 
.init.data: (between 'swap_list' and 'slab_early_init')
WARNING: mm/built-in.o(.sdata+0x3d8): Section mismatch: reference to 
.init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init')
WARNING: mm/built-in.o(.sdata+0x3e0): Section mismatch: reference to 
.init.data:initkmem_list3 (between 'swap_list' and 'slab_early_init')
WARNING: drivers/built-in.o(.data.rel.local+0x20a8): Section mismatch: 
reference to .init.text:acpi_processor_start (between 'acpi_processor_driver' 
and 'acpi_thermal_driver')
WARNING: drivers/built-in.o(.data.rel+0x1d80): Section mismatch: reference to 
.init.text:serial8250_console_setup (between 'serial8250_console' and 
'dpm_active')
WARNING: drivers/built-in.o(.sdata+0x788): Section mismatch: reference to 
.init.data: (between 'first.20152' and 'enabled')
WARNING: drivers/built-in.o(.sdata+0x790): Section mismatch: reference to 
.init.data: (between 'first.20152' and 'enabled')
WARNING: drivers/built-in.o(.sdata+0xa18): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xa20): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xa28): Section mismatch: reference to 
.init.data: (between 'scsi_null_device_strs' and 'fc_dev_loss_tmo')
WARNING: drivers/built-in.o(.sdata+0xac8): Section mismatch: reference to 
.init.data: (between 'Symbios_trailer.24436' and 'try_direct_io')
WARNING: drivers/built-in.o(.sdata+0xb00): Section mismatch: reference to 
.init.data: (between 'st_max_sg_segs' and 'osst_version')
WARNING: arch/ia64/hp/common/built-in.o(.data.rel.local+0xa8): Section 
mismatch: reference to .init.text:acpi_sba_ioc_add (between 
'acpi_sba_ioc_driver' and 'ioc_seq_ops')
WARNING: arch/ia64/hp/common/built-in.o(.sdata+0x0): Section mismatch: 
reference to .init.data:__setup_str_sba_page_override before 'reserve_sba_gart' 
(at offset -0x204c2613)
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend] - SN: validate smp_affinity mask on intr redirect

2007-05-08 Thread Peter Chubb

Jack>  }
Jack> +
Jack> +bool is_affinity_mask_valid(cpumask_t cpumask)
Jack> +{
Jack> + if (ia64_platform_is("sn2")) {
Jack> + /* Only allow one CPU to be specified in the smp_affinity mask 
*/
Jack> + if (cpus_weight(cpumask) != 1)
Jack> + return false;

Why not just:
return cpus_weight(cpumask) == 1;


It's a Boolean; treat it as one.
(If you thought the average kernel programmer (who's s/he?) understood
the logical implication rule it could be:
return !ia64_platform_is("sn2") || cpus_weight(cpumask) == 1;
)
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-13 Thread Peter Chubb
> "Jeremy" == Jeremy Fitzhardinge <[EMAIL PROTECTED]> writes:


Jeremy> And do the same in pte pages for actual mapped pages?  Or do
Jeremy> you think they would be too densely populated for it to be
Jeremy> worthwhile?

We've been doing some measurements on how densely clumped ptes are.
On 32-bit platforms, they're pretty dense.  On IA64, quite a bit
sparser, depending on the workload of course.  I think that's mostly because
of the larger pagesize on IA64 -- with 64k pages, you don't need very
many to map a small object.

I'm hoping IanW can give more details.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ski for huge page size !

2006-11-27 Thread Peter Chubb
>>>>> "sudhnesh" == sudhnesh adapawar <[EMAIL PROTECTED]> writes:

sudhnesh> Hey all !  I am thinking to use ski simulator as I can get
sudhnesh> the ia64 (Itanium 2)simulated on ia32 archiSo can I use
sudhnesh> this product for the project related to huge page size ???
sudhnesh> Will the problems related to huge pages such as
sudhnesh> swapping,IO,etc...will be covered if I use ski with 2.6
sudhnesh> kernel image configured for ia64 archi with huge page size
sudhnesh> support ?


Should work perfectly.  We've been using Ski for similar work, looking
at SuperPage support.
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to boot 2.6 kernel using hp ski simulator ???

2006-11-27 Thread Peter Chubb

Please check out http://www.gelato.unsw.edu.au/IA64wiki/SkiSimulator
for lots of info on Ski.

It works fine with Linux 2.6; and hugepage work too.

> 1) I used 'make ARCH=ia64 menuconfig' to configure and followed the
> steps to get kernel image of version 2.6 ! I also selected the generic
> type as Ski-simulator and also selected the HP-ski drivers something
> simscsi,etc.etc.

I suggest you start with
make sim_defconfig

Your symptoms look like a misconigured or misbuilt vmlinux.  The sim_defconfig

If you're running on IA32, then you need something like:
make CROSS_COMPILE=ia64-linux-gnu ARCH=ia64 boot 
to build kernel and bootloader.

You need to get or build yourself a disk image.  Instructions for
building at http://www.gelato.unsw.edu.au/IA64wiki/skidiskimage 




--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au   ERTOS within National ICT Australia
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: ip_contrack refuses to load if built UP as a module on IA64

2005-08-31 Thread Peter Chubb


This patch makes UP and SMP do the same thing as far as module per-cpu
data go.

Unfortunately it affects core code.

To repeat the problem:
  IA64 keeps per-cpu data in a small data area that is referenced by a
  22-bit offset, for both UP and SMP cases.  If a module defines
  per-cpu data, it too will end up in the small-data area.  But the
  module loader at present special-cases the UP treatment of per-cpu
  data, assumes that it is in the GP-relative data area, and does
  nothing (for SMP it allocates space, and copies initialised data
  items into it) 

  The effect is that modules defining per-cpu data fail to load if
  they're built UP, because of an impossible relocation.

  The appended patch makes the treatment of per-cpu data uniform
  between UP and SMP cases.  For most architectures, the per-cpu data
  section will be empty for UP, and so the per-cpu setup code will not
  be invoked.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

diff --git a/arch/ia64/kernel/module.c b/arch/ia64/kernel/module.c
--- a/arch/ia64/kernel/module.c
+++ b/arch/ia64/kernel/module.c
@@ -951,4 +951,10 @@ percpu_modcopy (void *pcpudst, const voi
if (cpu_possible(i))
memcpy(pcpudst + __per_cpu_offset[i], src, size);
 }
+#else
+void
+percpu_modcopy (void *pcpudst, const void *src, unsigned long size)
+{
+   memcpy(pcpudst, src, size);
+}
 #endif /* CONFIG_SMP */
diff --git a/kernel/module.c b/kernel/module.c
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -209,7 +209,6 @@ static struct module *find_module(const 
return NULL;
 }
 
-#ifdef CONFIG_SMP
 /* Number of blocks used and allocated. */
 static unsigned int pcpu_num_used, pcpu_num_allocated;
 /* Size of each block.  -ve means used. */
@@ -352,29 +351,7 @@ static int percpu_modinit(void)
return 0;
 }  
 __initcall(percpu_modinit);
-#else /* ... !CONFIG_SMP */
-static inline void *percpu_modalloc(unsigned long size, unsigned long align,
-   const char *name)
-{
-   return NULL;
-}
-static inline void percpu_modfree(void *pcpuptr)
-{
-   BUG();
-}
-static inline unsigned int find_pcpusec(Elf_Ehdr *hdr,
-   Elf_Shdr *sechdrs,
-   const char *secstrings)
-{
-   return 0;
-}
-static inline void percpu_modcopy(void *pcpudst, const void *src,
- unsigned long size)
-{
-   /* pcpusec should be 0, and size of that section should be 0. */
-   BUG_ON(size != 0);
-}
-#endif /* CONFIG_SMP */
+
 
 #ifdef CONFIG_MODULE_UNLOAD
 #define MODINFO_ATTR(field)\
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


'mdio_bus_exit' in discarded section .text.exit

2005-08-31 Thread Peter Chubb

When building with  CONFIG_PHYLIB=y on Itanium, I see:
 `mdio_bus_exit' referenced in section `.init.text' of
drivers/built-in.o: defined in discarded section `.exit.text' of
drivers/built-in.o

I believe that mdio_bus_exit should not be declared __exit, because it
is referencesd from __init sections in, say, phy_init().

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -170,7 +170,7 @@ int __init mdio_bus_init(void)
   return bus_register(&mdio_bus_type);
 }
 
-void __exit mdio_bus_exit(void)
+void mdio_bus_exit(void)
 {
bus_unregister(&mdio_bus_type);
 }


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the performance bottleneck?

2005-08-29 Thread Peter Chubb
>>>>> "Holger" == Holger Kiehl <[EMAIL PROTECTED]> writes:

Holger> Hello I have a system with the following setup:

(4-way CPUs, 8 spindles on two controllers)

Try using XFS.

See http://scalability.gelato.org/DiskScalability_2fResults --- ext3
is single threaded and tends not to get the full benefit of either the
multiple spindles nor the multiple processors.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Include assembly entry points in TAGS

2005-08-22 Thread Peter Chubb

As it stands, etags doesn't find labels in the IA64 or i386 assembler source
code, because they're disguised inside a preprocessor macro.

I propose the attached fix, which adds a regular expression to enable
labels disguised by ENTRY() and GLOBAL_ENTRY() macros.

There's a similar problem for MIPS, which needs to match LEAF(entrypoint)

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

diff --git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -1187,7 +1187,7 @@ cscope: FORCE
$(call cmd,cscope)
 
 quiet_cmd_TAGS = MAKE   $@
-cmd_TAGS = $(all-sources) | etags -
+cmd_TAGS = $(all-sources) | etags 
--regex='{asm}/\(GLOBAL_\)?ENTRY(\([^)]+\))/\2/' -
 
 #  Exuberant ctags works better with -I
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fcntl(F_GETLEASE) semantics??

2005-08-10 Thread Peter Chubb
>>>>> "Trond" == Trond Myklebust <[EMAIL PROTECTED]> writes:

Trond> to den 11.08.2005 Klokka 09:48 (+1000) skreiv Peter Chubb:
>> Hi, The LTP test fcntl23 is failing.  It does, in essence, fd =
>> open(xxx, O_RDWR|O_CREAT, 0777); if (fcntl(fd, F_SETLEASE, F_RDLCK)
>> == -1) fail;
>> 
>> fcntl always returns EAGAIN here.  The manual page says that a read
>> lease causes notification when `another process' opens the file for
>> writing or truncates it.  The kernel implements `any process'
>> (including the current one).
>> 
>> Which semantics are correct?  Personally I think that what the
>> kernel implements is correct (you can't get a read lease unsless
>> there are no writers _at_ _all_)

Trond> A read lease should mean that there are no writers at all.

Trond> If we were to allow the current process to open for write, then
Trond> that would still mean that nobody else can get a lease. In
Trond> effect you have been granted a lease with exclusive semantics
Trond> (i.e. a write lease). You might as well request that instead of
Trond> pretending it is a read lease.

So the manual page is wrong.  Fine.


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


fcntl(F_GETLEASE) semantics??

2005-08-10 Thread Peter Chubb

Hi,
The LTP test fcntl23 is failing.  It does, in essence, 
fd = open(xxx, O_RDWR|O_CREAT, 0777);
if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1)
   fail;

fcntl always returns EAGAIN here.  The manual page says that a read
lease causes notification when `another process' opens the file for
writing or truncates it.  The kernel implements `any process'
(including the current one).

Which semantics are correct?  Personally I think that what the kernel
implements is correct (you can't get a read lease unsless there are no
writers _at_ _all_)


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to get the physical page addresses from a kernel virtual address for DMA SG List?

2005-08-04 Thread Peter Chubb
You may want to take a look at the user-mode driver infrastructure
patches, which do almost exactly what you're trying to do.

Get them from
http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/usrdrivers/kernel-2.6.12-rc3/

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hangcheck problem

2005-03-30 Thread Peter Chubb
>>>>> "Noah" == Noah Silverman <[EMAIL PROTECTED]> writes:

Noah> Sorry 2.6.7


Noah> Burton Windle wrote:
>> Kernel version?

Are you running on an x86 machine without TSC, e.g., a 486?  the
Hangcheck timer then devolves into using jiffies, and a single jiffy
error gives you the printout you mention.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to measure time accurately.

2005-03-29 Thread Peter Chubb
>>>>> "Chris" == Chris Friesen <[EMAIL PROTECTED]> writes:

Chris> krishna wrote:
>> Hi All,
>> 
>> Can any one tell me how to measure time accurately for a block of C
>> code in device drivers.  For example, If I want to measure the time
>> duration of firmware download.

Chris> Most cpus have some way of getting at a counter or decrementer
Chris> of various frequencies.  Usually it requires low-level hardware
Chris> knowledge and often it needs assembly code.

As a device driver is inside the linux kernel (unless you're writein a
user-mode device driver :-)) you can use the getcycles() macro that's
defined for most architectures.  It provides a snapshot of the
cycle-counter.

Caveats:
1.  If you're running with power management, the  cycle
counter ticks at a  variable rate.
2.  If you're on a multiprocessor, the cycle counters of
different processors need not be synchronised.
-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LBD/filesystems over 2TB: is it safe?

2005-03-21 Thread Peter Chubb
>>>>> "jniehof" == jniehof  <[EMAIL PROTECTED]> writes:

jniehof> Someone posted to the LBD list last December regarding some
jniehof> supposedly horrible bugs in large filesystems:
jniehof> https://www.gelato.unsw.edu.au/archives/lbd/2004-December/75.html
jniehof> https://www.gelato.unsw.edu.au/archives/lbd/2004-December/74.html

The changes in those emails are irrelevant --- they fail to take into
account the properties of the filesystems that they modify, that mean
that the 32-bit quantities being shifted will not overflow.

They're typically of the form:
-   iblock = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+   iblock = (sector_t) index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 
Now, on a 32-bit processor with 4k pages, PAGE_CACHE_SHIFT is 12, and
i_blkbits is also 12 if you're using 4k blocks (which you have to to
get a large filesystem).  So this does nothing and is safe.  The
on-disk format for ext[23] uses 32-bit block numbers, so your maximum
filesystem size is 16TB, and your maximum value of iblock is 2^32-1.

Please do benchmark XFS and ext3 on your system before choosing.  Our
tests (to be published in Linux.Conf.Au next month) show that XFS is
significantly faster for some workloads.
Also its scalability to very large filesystems is much more mature than ext3.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: forkbombing Linux distributions

2005-03-20 Thread Peter Chubb
>>>>> "William" == William Beebe <[EMAIL PROTECTED]> writes:

William> Sure enough, I created the following script and ran it as a
William> non-root user:

William> #!/bin/bash $0 & $0 &

There are two approaches to fixing this.
  1.  Rate limit fork().  Unfortunately some legitimate usges do a lot
  of forking, and you don't really want to slow them down.
  2.  Limit (per user) the number of processes allowed. This is what's
  currently done; and if you as administrator want to you can set
  RLIMIT_NPROC in /etc/security/limits.conf

On an almost-single-user system such as most desktops, there isn't much
point in setting this.  On shared systems, it can be useful.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm_dirty_ratio seems a bit large.

2005-03-17 Thread Peter Chubb
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes:

Andrew> Robin Holt <[EMAIL PROTECTED]> wrote:

>>  One other issue we have is the vm_dirty_ratio and background_ratio
>> adjustments are a little coarse with these memory sizes.  Since our
>> minimum adjustment is 1%, we are adjusting by 40GB on the largest
>> configuration from above.  The hardware we are shipping today is
>> capable of going to far greater amounts of memory, but we don't
>> have customers demanding that yet.  I would like to plan ahead for
>> that and change vm_dirty_ratio from a straight percent into a
>> millipercent (thousandth of a percent).  Would that type of change
>> be acceptable?

Andrew> Oh drat.  I think such a change would require a new set of
Andrew> /proc entries.  

No, you could just extend them to understand fixed point.  Keep
printing integers as integers, print non-integers with one (or two:
will we ever need 0.01% increments?) decimal places.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Can no longer build ipv6 built-in (2.6.11, today's BK head)

2005-03-15 Thread Peter Chubb


Changeset 
  [EMAIL PROTECTED]|ChangeSet|20050310043957|06845
added cleanup to ipv6_init(), which calls ip6_route_cleanup()

ip6_route_cleanup() is marked __exit so cannot be called from an
__init section -- it's discarded by the linker from the image
(although it'll be retained in a module).

You get errors like this:
ip6_route_cleanup: discarded in section `.exit.text' from
net/built-in.o 
xfrm6_fini: discarded in section `.exit.text' from net/built-in.o
fib6_gc_cleanup: discarded in section `.exit.text' from net/built-in.o
ipv6_packet_cleanup: discarded in section `.exit.text' from
net/built-in.o


A simple fix is to delete the __exit from the various functions now that
they're called other than at module_exit.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

Index: linux-2.5-import/net/ipv6/route.c
===
--- linux-2.5-import.orig/net/ipv6/route.c  2005-03-16 10:12:44.742595387 
+1100
+++ linux-2.5-import/net/ipv6/route.c   2005-03-16 13:01:50.246678866 +1100
@@ -2116,7 +2116,7 @@
 #endif
 }
 
-void __exit ip6_route_cleanup(void)
+void ip6_route_cleanup(void)
 {
 #ifdef CONFIG_PROC_FS
proc_net_remove("ipv6_route");
Index: linux-2.5-import/net/ipv6/ipv6_sockglue.c
===
--- linux-2.5-import.orig/net/ipv6/ipv6_sockglue.c  2005-03-16 
10:12:44.736736056 +1100
+++ linux-2.5-import/net/ipv6/ipv6_sockglue.c   2005-03-16 13:24:19.095793200 
+1100
@@ -698,7 +698,7 @@
dev_add_pack(&ipv6_packet_type);
 }
 
-void __exit ipv6_packet_cleanup(void)
+void ipv6_packet_cleanup(void)
 {
dev_remove_pack(&ipv6_packet_type);
 }
Index: linux-2.5-import/net/ipv6/ip6_fib.c
===
--- linux-2.5-import.orig/net/ipv6/ip6_fib.c2005-03-15 12:28:44.819748921 
+1100
+++ linux-2.5-import/net/ipv6/ip6_fib.c 2005-03-16 13:27:46.423351526 +1100
@@ -1218,7 +1218,7 @@
panic("cannot create fib6_nodes cache");
 }
 
-void __exit fib6_gc_cleanup(void)
+void fib6_gc_cleanup(void)
 {
del_timer(&ip6_fib_timer);
kmem_cache_destroy(fib6_node_kmem);
Index: linux-2.5-import/net/ipv6/xfrm6_policy.c
===
--- linux-2.5-import.orig/net/ipv6/xfrm6_policy.c   2005-03-15 
12:28:44.853928319 +1100
+++ linux-2.5-import/net/ipv6/xfrm6_policy.c2005-03-16 13:53:28.890552848 
+1100
@@ -276,7 +276,7 @@
xfrm_policy_register_afinfo(&xfrm6_policy_afinfo);
 }
 
-static void __exit xfrm6_policy_fini(void)
+static void xfrm6_policy_fini(void)
 {
xfrm_policy_unregister_afinfo(&xfrm6_policy_afinfo);
 }
@@ -287,7 +287,7 @@
xfrm6_state_init();
 }
 
-void __exit xfrm6_fini(void)
+void xfrm6_fini(void)
 {
//xfrm6_input_fini();
xfrm6_policy_fini();
Index: linux-2.5-import/net/ipv6/xfrm6_state.c
===
--- linux-2.5-import.orig/net/ipv6/xfrm6_state.c2005-03-15 
12:28:44.854904874 +1100
+++ linux-2.5-import/net/ipv6/xfrm6_state.c 2005-03-16 13:29:30.183337361 
+1100
@@ -129,7 +129,7 @@
xfrm_state_register_afinfo(&xfrm6_state_afinfo);
 }
 
-void __exit xfrm6_state_fini(void)
+void xfrm6_state_fini(void)
 {
xfrm_state_unregister_afinfo(&xfrm6_state_afinfo);
 }



-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-14 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Tue, 15 Mar 2005 14:47:42 +1100, Peter Chubb
Jon> <[EMAIL PROTECTED]> wrote:
>> What I really want to do is deprivilege the driver code as much as
>> possible.  Whatever a driver does, the rest of the system should
>> keep going.  That way malicious or buggy drivers can only affect
>> the processes that are trying to use the device they manage.
>> Moreover, it should be possible to kill -9 a driver, then restart
>> it, without the rest of the system noticing more than a hiccup.  To
>> do this, step one is to run the driver in user space, so that it's
>> subject to the same resource management control as any other
>> process.  Step two, which is a lot harder, is to connect the driver
>> back into the kernel so that it can be shared.  Tun/Tap can be used
>> for network devices, but it's really too slow -- you need zero-copy
>> and shared notification.

Jon> Have you considered running the drivers in a domain under Xen?

See the paper presented by Karlsruhr at OSDI:

Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Götz:
Unmodified Device Driver Reuse and Improved System Dependability via
Virtual Machines.  OSDI '04.

They're using L4, rather than Xen as the paravirtualisation layer.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-14 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb
Jon> <[EMAIL PROTECTED]> wrote:
>> >>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:
>> 
>> >> The scenario I'm thinking about with these patches are things
>> like >> low-latency user-level networking between nodes in a
>> cluster, where >> for good performance even with a kernel driver
>> you don't want to >> share your interrupt line with anything else.
>> 
Jon> The code needs to refuse to install if the IRQ line is shared.
>>  It does.  The request_irq() call explicitly does not include
>> SA_SHARED in its flags, so if the line is shared, it'll return an
>> error to user space when the driver tries to open the file
>> representing the interrupt.

Jon> Please put some big comments warning people about adding
Jon> SA_SHARED. I can easily see someone thinking that they are fixing
Jon> a bug by adding it. I'd probably even write a paragraph about
Jon> what will happen if SA_SHARED is added.

Will do.  The main problem here is X86, as other architectures either
don't care, or have enough interrupt lines.  And the people who are
paying me for this kind of thing all run IA64

What I really want to do is deprivilege the driver code as much as
possible.  Whatever a driver does, the rest of the system should keep
going.  That way malicious or buggy drivers can only affect the
processes that are trying to use the device they manage.  Moreover, it
should be possible to kill -9 a driver, then restart it, without the
rest of the system noticing more than a hiccup.  To do this,
step one is to run the driver in user space, so that it's subject to
the same resource management control as any other process.  Step two,
which is a lot harder, is to connect the driver back into the kernel
so that it can be shared.  Tun/Tap can be used for network devices,
but it's really too slow -- you need zero-copy and shared notification.


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


inode_lock heavily contended in 2.6.11

2005-03-13 Thread Peter Chubb

When running reaim7 on a 12-way IA64 on an ext2 filesystem on a ram
disc, I see very heavy contention on inode_lock.

lockstat output shows:

SPINLOCKS HOLDWAIT
  UTIL  CONMEAN(  MAX )   MEAN(  MAX )(% CPU) TOTAL NOWAIT SPIN RJECT  
NAME
 46.8% 52.4%  1.9us( 130us)   20us(8073us)(21.5%)   5072151 47.6% 52.4%0%  
inode_lock
 15.9% 59.5%  3.8us(  61us)   18us(7067us)( 3.9%)852983 40.5% 59.5%0%   
 __sync_single_inode+0xf0
  9.2% 59.0%  1.2us(  25us)   20us(8073us)( 7.8%)   1596487 41.0% 59.0%0%   
 generic_osync_inode+0xe0

 (etc).

Is anyone else seeing this on more realistic workloads?

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Mon, 14 Mar 2005 12:42:27 +1100, Peter Chubb
Jon> <[EMAIL PROTECTED]> wrote:
>> >>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:
>> 
>> >> The scenario I'm thinking about with these patches are things
>> like >> low-latency user-level networking between nodes in a
>> cluster, where >> for good performance even with a kernel driver
>> you don't want to >> share your interrupt line with anything else.

Jon> Instead of making up a new API what about making a library of
Jon> calls that emulates the common entry points used by device
Jon> drivers. The version I did for UML could take the same driver and
Jon> run it in user space or the kernel without changing source
Jon> code. I found this very useful.

The in-kernel device drivers interface is very large --- I want to
start with something a bit simpler.  We do have a compatibility
library, as yet unreleased, that allows the same drivers to run
in-kernel or in user space.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Sat, 12 Mar 2005 10:11:18 -0700 (MST), Zwane Mwaikambo
Jon> <[EMAIL PROTECTED]> wrote:
>> Alan's proposal sounds very plausible and additionally if we find
>> that we have an irq line screaming we could use the same supplied
>> information to disable userspace interrupt handled devices first.

Jon> I like it too and it would help Xen. Now we just need to modify
Jon> 800 device drivers to use it.

It's incomplete.  But you probably knew that...

The main problem I see is that even with the proposed interface, you'd
need to disable the interrupt in the interrupt controller, because
merely acknowledging an interrupt to a device doesn't stop it from
interrupting.  And you really want the device to stop asserting the
interrupt before doing an EOI, unless you're going to mask the
interrupt.  So you'd need to have an interface that not only
acknowledged the current interrupt but also prevented the device from
interrupting.  That typically means reading a status register (slow!)
and then setting one or more bits in one or more control registers.

Also for a user level driver you really want to do the EIO before
invoking user space.  Otherwise, depending on the interrupt
controller, lower numbered interrupts could be masked until the user
space returns --- which might be a long time off.

Reading the status register is typically one of the slowest
single parts of a device driver (latency can be > 2 usec), so you don't
really want to have to read it again within the driver... so you'd
probably want to pass it as part of the interrupt arguments to the
driver.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

>>  The scenario I'm thinking about with these patches are things like
>> low-latency user-level networking between nodes in a cluster, where
>> for good performance even with a kernel driver you don't want to
>> share your interrupt line with anything else.

Jon> The code needs to refuse to install if the IRQ line is shared.

It does.  The request_irq() call explicitly does not include SA_SHARED
in its flags, so if the line is shared, it'll return an error to user
space when the driver tries to open the file representing the interrupt.

Jon> Also what about SMP, if you shut the IRQ off on one CPU isn't it
Jon> still enabled on all of the others?

Nope.   disable_irq_nosync() talks to the interrupt controller, which
is common to all the processors.  The main problem is that it's slow,
because it has to go off-chip.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Fri, 11 Mar 2005 11:29:20 +0100, Pavel Machek <[EMAIL PROTECTED]>
Jon> wrote:
>> Hi!
>> 
>> > As many of you will be aware, we've been working on
>> infrastructure for > user-mode PCI and other drivers.  The first
>> step is to be able to > handle interrupts from user
>> space. Subsequent patches add > infrastructure for setting up DMA
>> for PCI devices.
>> >
>> > The user-level interrupt code doesn't depend on the other
>> patches, and > is probably the most mature of this patchset.
>> 
>> Okay, I like it; it means way easier PCI driver development.

Jon> It won't help with PCI driver development. I tried implementing
Jon> this for UML. If your driver has any bugs it won't get the
Jon> interrupts acknowledged correctly and you'll end up rebooting.

That's not actually true, at least when we developed drivers here.
The only times we had to reboot were the times we mucked up the dma
register settings, and dma'd all over the kernel by mistake...

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Jon" == Jon Smirl <[EMAIL PROTECTED]> writes:

Jon> On Fri, 11 Mar 2005 14:36:10 +1100, Peter Chubb
Jon> <[EMAIL PROTECTED]> wrote:
>>  As many of you will be aware, we've been working on infrastructure
>> for user-mode PCI and other drivers.  The first step is to be able
>> to handle interrupts from user space. Subsequent patches add
>> infrastructure for setting up DMA for PCI devices.

Jon> I've tried implementing this before and could not get around the
Jon> interrupt problem. Most interrupts on the x86 architecture are
Jon> shared.  Disabling the IRQ at the PIC blocks all of the shared

Fortunately, most interrupts on IA64, ARM, etc.,  are unshared.  And
with PCI-Express, the problem will go away.  Even on X86, things
aren't all bad: one can usually find a PCI slot which doesn't share
interrupts with anything you care about.

The scenario I'm thinking about with these patches are things like
low-latency user-level networking between nodes in a cluster, where
for good performance even with a kernel driver you don't want to share
your interrupt line with anything else.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-13 Thread Peter Chubb
>>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes:

Greg> On Fri, Mar 11, 2005 at 07:34:46PM +1100, Peter Chubb wrote:
>> >>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes:
>> 
Greg> On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote:
>> >> +/* + * The PCI subsystem is implemented as yet-another pseudo
>> >> filesystem, + * albeit one that is never mounted.  + * This is
>> its >> magic number.  + */ +#define USR_PCI_MAGIC (0x12345678)
>> 
Greg> If you make it a real, mountable filesystem, then you don't need
Greg> to have any of your new syscalls, right?  Why not just do that
Greg> instead?
>> 
>> 
>> The only call that would go is usr_pci_open() -- you'd still need
>> usr_pci_map()

Greg> see mmap(2)

mmap maps a file's contents into your own virtual memory.
usr_pci_map maps part of your own virtual memory into pci bus space
for a particular device (using the IOMMU if your machine has one), and
returns a scatterlist of bus addresses to hand to the device.

Different semantics entirely.


Greg> In fact, both of the above can be done today from /proc/bus/pci/
Greg> right?

Nope.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-13 Thread Peter Chubb

On Gwe, 2005-03-11 at 03:36, Peter Chubb wrote:
> +static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs 
> *regs)
> +{
> + struct irq_proc *idp = (struct irq_proc *)vidp;
> + 
> + BUG_ON(idp->irq != irq);
> + disable_irq_nosync(irq);
> + atomic_inc(&idp->count);
> + wake_up(&idp->q);
> + return IRQ_HANDLED;

Alan> You just deadlocked the machine in many configurations. You can't use
Alan> disable_irq for this trick you have to tell the kernel how to handle it.


Can you elaborate, please?  In particular, why doesn't essentially the
same action (disabling an interrupt before the EOI) in
note_interrupt() not lock up the machine?

I can see there'd be problems if the code allowed shared interrupts,
but it doesn't.


--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Microstate Accounting for 2.6.11

2005-03-11 Thread Peter Chubb
>>>>> "Andi" == Andi Kleen <[EMAIL PROTECTED]> writes:

Andi> Andrew Morton <[EMAIL PROTECTED]> writes:
>> Why does the kernel need this feature?
>> 
>> Have you any numbers on the overhead?

Andi> It does RDTSC and lots of complicated stuff twice for each
Andi> system call.  On P4 this will be extremly slow (> 1000cycles
Andi> combined) It is pretty unlikely that whatever it does justifies
Andi> this extreme overhead in a critical fast path.

Not really `lots of complicated stuff'.  Just swap a timer and set a
flag on entry:

msp->timers[msp->laststate] += now - msp->lastchange
msp->lastchange = now
msp->laststate = ONCPU_SYS
msp->cflags |= MSA_SYS


And swap timers and clear the flag on exit.  The flag's needed to
force return to ONCPU_SYS rather than ONCPU_USR if the task preempted or
interrupted while in a system call.

If there's a simpler, cheaper, faster way to track time spent in
system calls (as opposed to time spent in interrupt handlers, or on
the run queue)  thn I'd like to know what it is.

And I recognise there're are lots of people who don't want this ---
but there are some who do.  I've maintained this patch since mid 2003,
and have seen a steady trickle of downloads --- one or two a week.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-11 Thread Peter Chubb
>>>>> "Greg" == Greg KH <[EMAIL PROTECTED]> writes:

Greg> On Fri, Mar 11, 2005 at 02:37:17PM +1100, Peter Chubb wrote:
>> +/* + * The PCI subsystem is implemented as yet-another pseudo
>> filesystem, + * albeit one that is never mounted.  + * This is its
>> magic number.  + */ +#define USR_PCI_MAGIC (0x12345678)

Greg> If you make it a real, mountable filesystem, then you don't need
Greg> to have any of your new syscalls, right?  Why not just do that
Greg> instead?


The only call that would go is usr_pci_open() -- you'd still need 
usr_pci_map(), usr_pci_unmap() and usr_pci_get_consistent().

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Microstate Accounting for 2.6.11

2005-03-10 Thread Peter Chubb
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes:

Andrew> Peter Chubb <[EMAIL PROTECTED]> wrote:
>>  Timing data on threads at present is pretty crude: when the timer
>> interrupt occurs, a tick is added to either system time or user
>> time for the currently running thread.  Thus in an unpacthed kernel
>> one can distinguish three timed states: On-cpu in userspace, on-cpu
>> in system space, and not running.
>> 
>> The actual number of states is much larger.  A thread can be on a
>> runqueue or the expired queue (i.e., ready to run but not running),
>> sleeping on a semaphore or on a futex, having its time stolen to
>> service an interrupt, etc., etc.
>> 
>> This patch adds timers per-state to each struct task_struct, so
>> that time in all these states can be tracked.  This patch contains
>> the core code do the timing, and to initialise the timers.
>> Subsequent patches enable the code (by adding Kconfig options) and
>> add hooks to track state changes.

Andrew> Why does the kernel need this feature?

I find that it's useful when trying to work out why a thread is going
more slowly than it needs to.  Userspace tools in the CVS repository
at gelato.unsw.edu.au let you graph in real time the time spent in
each state, so you get graphs like this:

 http://gelato.unsw.edu.au/patches/snapshot.png

which shows mplay skipping because of a slow disk/filesystem.

Andrew> Have you any numbers on the overhead?

Around 5% on LMbench context switch numbers for uniprocessor,
negligeable on SMP (but SMP context switch results are horrible at the
moment according to LMbench2 -- almost 16usec); select on 10 fd goes
from 1.665 usec to 1.701; 

Andrew> The preempt_disable() in sys_msa() seems odd.

Yes I only added that yesterday.  It's to prevent migration while
updating the current timer.  All the other places where the current
timer are updated are naturally protected this.  It should probably be a
local_irq_disable() instead.

Peter C

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate accounting, IA64 support

2005-03-10 Thread Peter Chubb
Microstate Accounting: 
Add suppoort for IA64.


 linux-2.6-ustate/arch/ia64/Kconfig   |   25 +++
 linux-2.6-ustate/arch/ia64/kernel/entry.S|   44 +++
 linux-2.6-ustate/arch/ia64/kernel/irq_ia64.c |   21 +++-
 linux-2.6-ustate/arch/ia64/kernel/ivt.S  |8 +++-
 linux-2.6-ustate/include/asm-ia64/msa.h  |   33 
 linux-2.6-ustate/include/asm-ia64/unistd.h   |1 
 7 files changed, 129 insertions(+), 5 deletions(-)

Index: linux-2.6-ustate/arch/ia64/Kconfig
===
--- linux-2.6-ustate.orig/arch/ia64/Kconfig 2005-03-10 09:13:01.780632777 
+1100
+++ linux-2.6-ustate/arch/ia64/Kconfig  2005-03-10 09:16:14.593655619 +1100
@@ -302,6 +302,31 @@
  little bigger and slows down execution a bit, but it is generally
  a good idea to turn this on.  If you're unsure, say Y.
 
+config MICROSTATE
+   bool "Microstate accounting"
+   help
+ This option causes the kernel to keep very accurate track of
+ how long your threads spend on the runqueues, running, or asleep or
+ stopped.  It will slow down your kernel.
+ Times are reported in /proc/pid/msa and through a new msa()
+ system call.
+choice
+   depends on MICROSTATE
+   prompt "Microstate timing source"
+   default MICROSTATE_ITC
+   help
+  On IA64 one can use two timeing sources for the microstate
+  accounting;  the on-chip interval counter, or Linux's
+  time-of-day clock.  The first is very cheap; the other is
+  more accurate on SMP systems.
+
+config MICROSTATE_ITC
+   bool "Use on-chip ITC for microstate timing"
+ 
+config MICROSTATE_TOD
+   bool "Use time-of-day clock for microstate timings"
+endchoice
+
 config IA64_PALINFO
tristate "/proc/pal support"
help
Index: linux-2.6-ustate/include/asm-ia64/msa.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/include/asm-ia64/msa.h 2005-03-10 09:16:14.594632174 
+1100
@@ -0,0 +1,33 @@
+/
+ * asm-ia64/msa.h
+ *
+ * Provide an architecture-specific clock.
+ */
+
+#ifndef _ASM_IA64_MSA_H
+#define _ASM_IA64_MSA_H
+
+#include 
+#include 
+#include 
+
+
+# if defined(CONFIG_MICROSTATE_ITC)
+#   define MSA_NOW(now)  do { now = (clk_t)get_cycles(); } while (0)
+
+#   define MSA_TO_NSEC(clk) ((10*clk) / 
cpu_data(smp_processor_id())->itc_freq)
+
+# elif defined(CONFIG_MICROSTATE_TOD)
+static inline void msa_now(clk_t *nsp) {
+   struct timeval tv;
+   do_gettimeofday(&tv);
+   *nsp = tv.tv_sec * 100 + tv.tv_usec;
+}
+#   define MSA_NOW(x) msa_now(&x)
+#   define MSA_TO_NSEC(clk) ((clk) * 1000)
+
+# else
+#  include 
+# endif
+
+#endif /* _ASM_IA64_MSA_H */
Microstate Accounting: Track time in system calls for IA64

 arch/ia64/kernel/entry.S |   44 
 arch/ia64/kernel/ivt.S   |8 ++--
 2 files changed, 50 insertions(+), 2 deletions(-)

Index: linux-2.6-ustate/arch/ia64/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/ia64/kernel/entry.S  2005-03-10 
09:13:01.149778160 +1100
+++ linux-2.6-ustate/arch/ia64/kernel/entry.S   2005-03-10 09:16:15.157128068 
+1100
@@ -589,6 +589,46 @@
 .ret4: br.cond.sptk ia64_leave_kernel
 END(ia64_strace_leave_kernel)
 
+#ifdef CONFIG_MICROSTATE
+/*
+ * preserve input registers,
+ * and r8
+ */
+GLOBAL_ENTRY(invoke_msa_end_syscall)
+   .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
+   alloc loc1=ar.pfs,8,4,0,0
+   mov loc0=rp
+   .body
+   ;;
+   mov loc2=ret0
+   mov loc3=ret2
+   br.call.sptk.many rp=msa_end_syscall
+1: mov rp=loc0
+   mov ret0=loc2
+   mov ret2=loc3
+   mov ar.pfs=loc1
+   br.ret.sptk.many rp
+END(invoke_msa_end_syscall)
+/*
+ * Preserves in0-7, and all callee-save registers.
+ */
+GLOBAL_ENTRY(invoke_msa_start_syscall)
+   .prologue ASM_UNW_PRLG_RP|ASM_UNW_PRLG_PFS, ASM_UNW_PRLG_GRSAVE(8)
+   alloc loc1=ar.pfs,8,4,0,0
+   mov loc0=rp
+   .body
+   mov loc2=r3
+   mov loc3=r15
+   ;;
+   br.call.sptk.many rp=msa_start_syscall
+1: mov r15=loc3
+   mov r3=loc2
+   mov ar.pfs=loc1
+   mov rp=loc0
+   br.ret.sptk.many rp
+END(invoke_msa_start_syscall)
+#endif /* CONFIG_MICROSTATE */
+
 GLOBAL_ENTRY(ia64_ret_from_clone)
PT_REGS_UNWIND_INFO(0)
 {  /*
@@ -671,6 +711,10 @@
  */
 ENTRY(ia64_leave_syscall)
PT_REGS_UNWIND_INFO(0)
+#ifdef CONFIG_MICROSTATE
+   br.call.sptk.many rp=invoke_msa_end_syscall
+1: 
+#endif
/*
 * work.need_resched etc. mustn't get changed by this CPU before it 
returns to
 * user- or fsys-mode, hence we di

Microstate Accounting for 2.6.11, patch 4/6

2005-03-10 Thread Peter Chubb
Microstate accounting:  Account for time in interrupt handlers for I386.

 arch/i386/kernel/irq.c |   13 -
 1 files changed, 12 insertions(+), 1 deletion(-)


Index: linux-2.6-ustate/arch/i386/kernel/irq.c
===
--- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 
09:13:00.115606274 +1100
+++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 
+1100
@@ -55,6 +55,8 @@
 #endif
 
irq_enter();
+   msa_start_irq(irq);
+   
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
{
@@ -101,6 +103,7 @@
 #endif
__do_IRQ(irq, regs);
 
+   msa_finish_irq(irq);
irq_exit();
 
return 1;
@@ -221,10 +224,18 @@
seq_printf(p, "%3d: ",i);
 #ifndef CONFIG_SMP
seq_printf(p, "%10u ", kstat_irqs(i));
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, "%10llu", msa_irq_time(0, i));
+#endif
 #else
for (j = 0; j < NR_CPUS; j++)
-   if (cpu_online(j))
+   if (cpu_online(j)) {
seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, "%10llu", msa_irq_time(j, i));
+#endif
+   }
+
 #endif
seq_printf(p, " %14s", irq_desc[i].handler->typename);
seq_printf(p, "  %s", action->name);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 6/6

2005-03-10 Thread Peter Chubb


Microstate accounting: Track time spent asleep while paging,
in poll() or select(), or on a futex separately from other sleeps.

 fs/select.c |2 ++
 kernel/futex.c |2 ++
 mm/memory.c |6 +-


Index: linux-2.6-ustate/mm/memory.c
===
--- linux-2.6-ustate.orig/mm/memory.c   2005-03-10 09:12:59.492564100 +1100
+++ linux-2.6-ustate/mm/memory.c2005-03-10 09:16:16.583875465 +1100
@@ -2079,6 +2079,7 @@
if (is_vm_hugetlb_page(vma))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */
 
+   msa_next_state(current, PAGING_SLEEP);
/*
 * We need the page table lock to synchronize with kswapd
 * and the SMP-safe atomic PTE updates.
@@ -2098,10 +2099,13 @@
if (!pte)
goto oom;

-   return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+   int ret = handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+   msa_next_state(current, MSA_UNKNOWN);
+   return ret;
 
  oom:
spin_unlock(&mm->page_table_lock);
+   msa_next_state(current, MSA_UNKNOWN);
return VM_FAULT_OOM;
 }
 

Index: linux-2.6-ustate/kernel/futex.c
===
--- linux-2.6-ustate.orig/kernel/futex.c2005-03-10 09:12:58.843154938 
+1100
+++ linux-2.6-ustate/kernel/futex.c 2005-03-10 09:16:17.109262256 +1100
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
@@ -571,6 +572,7 @@
 * wakes us up.
 */
 
+   msa_next_state(current, FUTEX_SLEEP);
/* add_wait_queue is the barrier after __set_current_state. */
__set_current_state(TASK_INTERRUPTIBLE);
add_wait_queue(&q.waiters, &wait);


Index: linux-2.6-ustate/fs/select.c
===
--- linux-2.6-ustate.orig/fs/select.c   2005-03-10 09:12:59.182996124 +1100
+++ linux-2.6-ustate/fs/select.c2005-03-10 09:16:16.843639194 +1100
@@ -256,6 +256,7 @@
retval = table.error;
break;
}
+   msa_next_state(current, POLL_SLEEP);
__timeout = schedule_timeout(__timeout);
}
__set_current_state(TASK_RUNNING);
@@ -447,6 +448,7 @@
count = wait->error;
if (count)
break;
+   msa_next_state(current, POLL_SLEEP);
timeout = schedule_timeout(timeout);
}
__set_current_state(TASK_RUNNING);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 5/6

2005-03-10 Thread Peter Chubb
Microstate accounting: Add the I386 system call.

 arch/i386/kernel/entry.S  |2 +-
 include/asm-i386/unistd.h |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6-ustate/arch/i386/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/i386/kernel/entry.S  2005-03-10 
09:16:14.888575341 +1100
+++ linux-2.6-ustate/arch/i386/kernel/entry.S   2005-03-10 09:16:15.446188457 
+1100
@@ -876,7 +876,7 @@
.long sys_mq_getsetattr
.long sys_ni_syscall/* reserved for kexec */
.long sys_waitid
-   .long sys_ni_syscall/* 285 */ /* available */
+   .long sys_msa   /* 285 */ /* available */
.long sys_add_key
.long sys_request_key
.long sys_keyctl
Index: linux-2.6-ustate/include/asm-i386/unistd.h
===
--- linux-2.6-ustate.orig/include/asm-i386/unistd.h 2005-03-10 
09:13:00.813843194 +1100
+++ linux-2.6-ustate/include/asm-i386/unistd.h  2005-03-10 09:16:15.448141568 
+1100
@@ -290,7 +290,7 @@
 #define __NR_mq_getsetattr (__NR_mq_open+5)
 #define __NR_sys_kexec_load283
 #define __NR_waitid284
-/* #define __NR_sys_setaltroot 285 */
+#define __NR_sys_msa   285
 #define __NR_add_key   286
 #define __NR_request_key   287
 #define __NR_keyctl288
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 3/6

2005-03-10 Thread Peter Chubb

Microstate accounting:  

Provide I386-dependent MSA clocks, and Kconfig options.

 arch/i386/Kconfig  |   39 ++-
 include/asm-i386/msa.h |   49 +
 2 files changed, 87 insertions(+), 1 deletion(-)

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

Index: linux-2.6-ustate/arch/i386/Kconfig
===
--- linux-2.6-ustate.orig/arch/i386/Kconfig 2005-03-11 09:59:38.773632446 
+1100
+++ linux-2.6-ustate/arch/i386/Kconfig  2005-03-11 09:59:38.777538666 +1100
@@ -923,8 +923,45 @@
 
  If unsure, say Y. Only embedded should say N here.
 
-endmenu
+config MICROSTATE
+   bool "Microstate accounting"
+   help
+ This option causes the kernel to keep very accurate track of
+how long your threads spend on the runqueues, running, or asleep or
+stopped.  It will slow down your kernel.
+Times are reported in /proc/pid/msa and through a new msa()
+system call.
+
+choice 
+   depends on MICROSTATE
+   prompt "Microstate timing source"
+   default MICROSTATE_TSC
+
+config MICROSTATE_PM
+   bool "Use Power-Management timer for microstate timings"
+   depends on X86_PM_TIMER
+   help
+If your machine is ACPI enabled and uses power-management, then the 
+TSC runs at a variable rate, which will distort the 
+microstate measurements.  This timer, although having
+slightly more overhead, and a lower resolution (279
+nanoseconds or so) will always run at a constant rate.
+
+config MICROSTATE_TSC
+   bool "Use on-chip TSC for microstate timings"
+   depends on X86_TSC
+   help
+ If your machine's clock runs at constant rate, then this timer 
+gives you cycle precision in measureing times spent in microstates.
+
+config MICROSTATE_TOD
+   bool "Use time-of-day clock for microstate timings"
+   help
+ If none of the other timers are any good for you, this timer 
+will give you micro-second precision.
+endchoice
 
+endmenu
 
 menu "Power management options (ACPI, APM)"
depends on !X86_VOYAGER
Index: linux-2.6-ustate/include/asm-i386/msa.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/include/asm-i386/msa.h 2005-03-11 09:59:38.779491777 
+1100
@@ -0,0 +1,49 @@
+/
+ * asm-i386/msa.h
+ *
+ * Provide an architecture-specific clock.
+ */
+
+#ifndef _ASM_I386_MSA_H
+# define _ASM_I386_MSA_H
+
+# include 
+
+
+# if defined(CONFIG_MICROSTATE_TSC)
+/*
+ * Use the processor's time-stamp counter as a timesource
+ */
+#  include 
+#  include 
+
+#  define MSA_NOW(now)  rdtscll(now)
+
+extern unsigned long cpu_khz;
+#  define MSA_TO_NSEC(clk) ({ clk_t _x = ((clk) * 100ULL); do_div(_x, 
cpu_khz); _x; })
+
+# elif defined(CONFIG_MICROSTATE_PM)
+/*
+ * Use the system's monotonic clock as a timesource.
+ * This will only be enabled if the Power Management Timer is enabled.
+ */
+unsigned long long monotonic_clock(void);
+#  define MSA_NOW(now) do { now = monotonic_clock(); } while (0)
+#  define MSA_TO_NSEC(clk) (clk)
+
+# elif defined(CONFIG_MICROSTATE_TOD)
+/*
+ * Fall back to gettimeofday.
+ * This one is incompatible with interrupt-time measurement on some processors.
+ */
+static inline void msa_now(clk_t *nsp) {
+   struct timeval tv;
+   do_gettimeofday(&tv);
+   *nsp = tv.tv_sec * 100 + tv.tv_usec;
+}
+#   define MSA_NOW(x) msa_now(&x)
+#   define MSA_TO_NSEC(clk) ((clk) * 1000)
+# endif
+
+
+#endif /* _ASM_I386_MSA_H */
I386
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 2/6

2005-03-10 Thread Peter Chubb
Microstate Accounting:
Add hooks into the scheduler to track state changes.
Arrange for parent process's child times to be updated at process exit. 


 kernel/sched.c |8 
 kernel/exit.c  |3 +++

Index: linux-2.6-ustate/kernel/sched.c
===
--- linux-2.6-ustate.orig/kernel/sched.c2005-03-11 09:59:31.109628035 
+1100
+++ linux-2.6-ustate/kernel/sched.c 2005-03-11 09:59:31.116463921 +1100
@@ -635,6 +635,7 @@
  */
 static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
+   msa_set_timer(p, ONACTIVEQUEUE);
enqueue_task(p, rq->active);
rq->nr_running++;
 }
@@ -1238,6 +1239,7 @@
if (unlikely(!current->array))
__activate_task(p, rq);
else {
+   msa_set_timer(p, ONACTIVEQUEUE);
p->prio = current->prio;
list_add_tail(&p->run_list, ¤t->run_list);
p->array = current->array;
@@ -2422,6 +2424,7 @@
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
+   msa_next_state(p, ONEXPIREDQUEUE);
enqueue_task(p, rq->expired);
if (p->static_prio < rq->best_expired_prio)
rq->best_expired_prio = p->static_prio;
@@ -2733,6 +2736,7 @@
array = rq->active;
rq->expired_timestamp = 0;
rq->best_expired_prio = MAX_PRIO;
+   msa_flip_expired(prev);
} else
schedstat_inc(rq, sched_noswitch);
 
@@ -2773,6 +2777,8 @@
rq->curr = next;
++*switch_count;
 
+   msa_switch(prev, next);
+
prepare_arch_switch(rq, next);
prev = context_switch(rq, prev, next);
barrier();
@@ -3693,6 +3699,8 @@
 */
if (rt_task(current))
target = rq->active;
+   else
+   msa_next_state(current, ONEXPIREDQUEUE);
 
if (current->array->nr_active == 1) {
schedstat_inc(rq, yld_act_empty);


Index: linux-2.6-ustate/kernel/exit.c
===
--- linux-2.6-ustate.orig/kernel/exit.c 2005-03-11 09:59:36.360564796 +1100
+++ linux-2.6-ustate/kernel/exit.c  2005-03-11 09:59:36.364471017 +1100
@@ -93,6 +93,9 @@
}
 
sched_exit(p);
+
+   msa_update_parent(p->parent, p);
+
write_unlock_irq(&tasklist_lock);
spin_unlock(&p->proc_lock);
proc_pid_flush(proc_dentry);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11, patch 3/

2005-03-10 Thread Peter Chubb

Microstate Accounting: Track time in system calls and interrupts, i386 code.

Signed-off-by; Peter Chubb <[EMAIL PROTECTED]>

 arch/i386/kernel/entry.S |   16 
 arch/i386/kernel/irq.c |   13 -


Index: linux-2.6-ustate/arch/i386/kernel/entry.S
===
--- linux-2.6-ustate.orig/arch/i386/kernel/entry.S  2005-03-10 
09:13:01.448604031 +1100
+++ linux-2.6-ustate/arch/i386/kernel/entry.S   2005-03-10 09:16:14.888575341 
+1100
@@ -222,10 +222,18 @@
/* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not 
testb */
testw 
$(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp)
jnz syscall_trace_entry
+#ifdef CONFIG_MICROSTATE
+   pushl   %eax
+   call msa_start_syscall
+   popl%eax
+#endif
cmpl $(nr_syscalls), %eax
jae syscall_badsys
call *sys_call_table(,%eax,4)
movl %eax,EAX(%esp)
+#ifdef CONFIG_MICROSTATE
+   call msa_end_syscall
+#endif
cli
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx
@@ -250,9 +258,17 @@
cmpl $(nr_syscalls), %eax
jae syscall_badsys
 syscall_call:
+#ifdef CONFIG_MICROSTATE
+   pushl   %eax
+   call msa_start_syscall
+   popl%eax
+#endif
call *sys_call_table(,%eax,4)
movl %eax,EAX(%esp) # store the return value
 syscall_exit:
+#ifdef CONFIG_MICROSTATE
+   call msa_end_syscall
+#endif
cli # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret


Index: linux-2.6-ustate/arch/i386/kernel/irq.c
===
--- linux-2.6-ustate.orig/arch/i386/kernel/irq.c2005-03-10 
09:13:00.115606274 +1100
+++ linux-2.6-ustate/arch/i386/kernel/irq.c 2005-03-10 09:16:16.032121680 
+1100
@@ -55,6 +55,8 @@
 #endif
 
irq_enter();
+   msa_start_irq(irq);
+   
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
{
@@ -101,6 +103,7 @@
 #endif
__do_IRQ(irq, regs);
 
+   msa_finish_irq(irq);
irq_exit();
 
return 1;
@@ -221,10 +224,18 @@
seq_printf(p, "%3d: ",i);
 #ifndef CONFIG_SMP
seq_printf(p, "%10u ", kstat_irqs(i));
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, "%10llu", msa_irq_time(0, i));
+#endif
 #else
for (j = 0; j < NR_CPUS; j++)
-   if (cpu_online(j))
+   if (cpu_online(j)) {
seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+#ifdef CONFIG_MICROSTATE
+   seq_printf(p, "%10llu", msa_irq_time(j, i));
+#endif
+   }
+
 #endif
seq_printf(p, " %14s", irq_desc[i].handler->typename);
seq_printf(p, "  %s", action->name);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Microstate Accounting for 2.6.11

2005-03-10 Thread Peter Chubb


Microstate Accounting
-

Timing data on threads at present is pretty crude:  when the timer
interrupt occurs, a tick is added to either system time or user time
for the currently running thread.  Thus in an unpacthed kernel one can
distinguish three timed states:  On-cpu in userspace, on-cpu in system
space, and not running.

The actual number of states is much larger.  A thread can be on a
runqueue or  the expired queue (i.e., ready to run but not running),
sleeping on a semaphore or on a futex, having its time stolen to
service an interrupt, etc., etc.

This patch adds timers per-state to each struct task_struct, so that
time in all these states can be tracked.  This patch contains the core
code do the timing, and to initialise the timers.  Subsequent patches
enable the code (by adding Kconfig options) and add hooks to track
state changes.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

 include/asm-generic/msa.h  |   21 ++
 include/linux/msa-kernel.h |   99 +
 include/linux/msa.h|   46 
 include/linux/sched.h  |4 
 kernel/Makefile|2 
 kernel/fork.c  |2 
 kernel/msa.c   |  472 +
 7 files changed, 645 insertions(+), 1 deletion(-)

Index: linux-2.6-ustate/kernel/msa.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-ustate/kernel/msa.c   2005-03-11 09:58:20.574030768 +1100
@@ -0,0 +1,472 @@
+/*
+ * Microstate accounting.
+ * Try to account for various states much more accurately than
+ * the normal code does.
+ *
+ * Copyright (c) Peter Chubb 2005
+ *  UNSW and National ICT Australia
+ * This code is released under the Gnu Public Licence, version 2.
+ */
+
+
+#include 
+#include 
+#include 
+#include 
+#ifdef CONFIG_MICROSTATE
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/*
+ * Track time spend in interrupt handlers.
+ */
+struct msa_irq {
+   clk_t times;
+   clk_t last_entered;
+};
+
+/*
+ * When the scheduler last swapped active and expired queues
+ */
+static DEFINE_PER_CPU(clk_t, queueflip_time);
+
+/*
+ * Time spent in interrupt handlers
+ */
+static DEFINE_PER_CPU(struct msa_irq[NR_IRQS+1], msa_irq);
+
+
+/**
+ * msa_switch: Update microstate timers when switching from one task to 
another.
+ * @prev, @next:  The prev task is coming off the processor;
+ *the new task is about to run on the processor.
+ *
+ * Update the times in both prev and next.  It may be necessary to infer the 
+ * next state for each task.
+ *
+ */
+void
+msa_switch(struct task_struct *prev, struct task_struct *next)
+{
+   struct microstates *msprev = &prev->microstates;
+   struct microstates *msnext = &next->microstates;
+   clk_t now;
+   enum thread_state next_state;
+   int interrupted = msprev->cur_state == INTERRUPTED;
+
+   preempt_disable();
+
+   MSA_NOW(now);
+
+   if (msprev->flags & QUEUE_FLIPPED) {
+   __get_cpu_var(queueflip_time) = now;
+   msprev->flags &= ~QUEUE_FLIPPED;
+   }
+
+   /*
+* If the queues have been flipped,
+* update the state as of the last flip time.
+*/
+   if (msnext->cur_state == ONEXPIREDQUEUE) {
+   clk_t qfp = per_cpu(queueflip_time, msnext->lastqueued);
+   msnext->cur_state = ONACTIVEQUEUE;
+   msnext->timers[ONEXPIREDQUEUE] += qfp - msnext->last_change;
+   msnext->last_change = qfp;
+   }
+
+   msprev->timers[msprev->cur_state] += now - msprev->last_change;
+   msnext->timers[msnext->cur_state] += now - msnext->last_change;
+   
+   /* Update states */
+   switch (msprev->next_state) {
+   case MSA_UNKNOWN:
+   /*
+* Infer from actual state
+*/
+   switch (prev->state) {
+   case TASK_INTERRUPTIBLE:
+   next_state = INTERRUPTIBLE_SLEEP;
+   break;
+   
+   case TASK_UNINTERRUPTIBLE:
+   next_state = UNINTERRUPTIBLE_SLEEP;
+   break;
+
+   case TASK_STOPPED:
+   next_state = STOPPED;
+   break;
+
+   case EXIT_DEAD:
+   case EXIT_ZOMBIE:
+   next_state = ZOMBIE;
+   break;
+
+   case TASK_RUNNING:  
+   next_state = ONACTIVEQUEUE;
+   break;
+
+   default:
+   next_state = MSA_UNKNOWN;
+   break;
+
+   } 
+   break;
+
+   case PAGING_SLEEP: /*
+   * Sleep states 

User mode drivers: part 2: PCI device handling (patch 2/2 for 2.6.11)

2005-03-10 Thread Peter Chubb

User-level drivers:  Add system calls for I386 and IA64.
Signed-Off-By: Peter Chubb <[EMAIL PROTECTED]>

# 
# arch/i386/kernel/entry.S  |4 
# arch/ia64/kernel/entry.S  |8 
# include/asm-i386/unistd.h |6 +-
# include/asm-ia64/unistd.h |4 
# 4 files changed, 17 insertions(+), 5 deletions(-)
#
Index: linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S
===
--- linux-2.6.11-usrdrivers.orig/arch/ia64/kernel/entry.S   2005-03-11 
13:59:28.940744950 +1100
+++ linux-2.6.11-usrdrivers/arch/ia64/kernel/entry.S2005-03-11 
13:59:41.236542676 +1100
@@ -1577,10 +1577,10 @@
data8 sys_add_key
data8 sys_request_key
data8 sys_keyctl
-   data8 sys_ni_syscall
-   data8 sys_ni_syscall// 1275
-   data8 sys_ni_syscall
-   data8 sys_ni_syscall
+   data8 sys_usr_pci_open
+   data8 sys_usr_pci_mmap  // 1275
+   data8 sys_usr_pci_munmap
+   data8 sys_usr_pci_get_consistent
data8 sys_ni_syscall
data8 sys_ni_syscall
 
Index: linux-2.6.11-usrdrivers/include/asm-i386/unistd.h
===
--- linux-2.6.11-usrdrivers.orig/include/asm-i386/unistd.h  2005-03-11 
13:59:28.942698059 +1100
+++ linux-2.6.11-usrdrivers/include/asm-i386/unistd.h   2005-03-11 
13:59:41.245331667 +1100
@@ -294,8 +294,12 @@
 #define __NR_add_key   286
 #define __NR_request_key   287
 #define __NR_keyctl288
+#define __NR_usr_pci_open  289
+#define __NR_usr_pci_mmap  (__NR_usr_pci_open+1)
+#define __NR_usr_pci_munmap(__NR_usr_pci_open+2)
+#define __NR_usr_pci_get_consistent(__NR_usr_pci_open+3)
 
-#define NR_syscalls 289
+#define NR_syscalls 293
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
Index: linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h
===
--- linux-2.6.11-usrdrivers.orig/include/asm-ia64/unistd.h  2005-03-11 
13:59:28.942698059 +1100
+++ linux-2.6.11-usrdrivers/include/asm-ia64/unistd.h   2005-03-11 
13:59:41.247284776 +1100
@@ -263,6 +263,10 @@
 #define __NR_add_key   1271
 #define __NR_request_key   1272
 #define __NR_keyctl1273
+#define __NR_usr_pci_open   1274
+#define __NR_usr_pci_mmap   1275
+#define __NR_usr_pci_unmap  1276
+#define __NR_usr_pci_get_consistent 1277
 
 #ifdef __KERNEL__
 
Index: linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S
===
--- linux-2.6.11-usrdrivers.orig/arch/i386/kernel/entry.S   2005-03-11 
13:59:28.941721505 +1100
+++ linux-2.6.11-usrdrivers/arch/i386/kernel/entry.S2005-03-11 
13:59:41.248261330 +1100
@@ -864,5 +864,9 @@
.long sys_add_key
.long sys_request_key
.long sys_keyctl
+   .long sys_usr_pci_open
+   .long sys_usr_pci_mmap  /* 290 */
+   .long sys_usr_pci_munmap
+   .long sys_usr_pci_get_consistent
 
 syscall_table_size=(.-sys_call_table)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-10 Thread Peter Chubb


USER LEVEL DRIVERS: enable PCI device drivers at user space.

This patch adds the capability for suitably privileged user-level processes to 
enable a PCI device, and set up DMA for it.  A subsequent patch hooks up 
the actual system calls.

There are three new system calls:

  long   usr_pci_open(int bus, int slot, int function, __u64 dma_mask);
 Returns a filedescriptor for the PCI device described 
 by bus,slot,function.  It also enables the device, and sets it 
 up as a bus-mastering DMA device, with the specified dma mask.

 Error codes are:
ENOMEM: insufficient kernel memory to fulfil your  request
ENOENT: the specified device doesn't exist, or is otherwise
invisible to Linux.
EBUSY: Another driver has claimed the device
EIO:   The specified dma mask is invalid for this device.
ENFILE: too many open files

  long usr_pci_get_consistent(int fd, size_t size, void **vaddrp, unsigned 
long *dmaaddrp)

Call pci_alloc_consistent() to get size worth of pci
consistent memory (currently an error if size != PAGESIZE); 
map the allocated memory into the user's address space; 
return the virtual user address in *vaddrp, and the bus 
address in *dmaaddrp

ERRORS:
EINVAL: the filedescriptor was not one obtained from usr_pci_open(), or
size != PAGESIZE
ENOMEM: insufficient  appropriate memory or insufficient free 
virtual address space in the user program.
EFAULT: vaddrp or dmaaddrp didn't point to writeable memory.

The mapping obtained can be cleaned up with munmap().

   long usr_pci_mmap(int fd, struct mapping_info *mp) -- 
map some memory for DMA to/from the device represented by fd, 
which was obtained from usr_pci_open().

struct mapping_info contains:
void *virtaddr -- the virtual address to dma to
int size -- how many bytes to set up
struct usr_pci_sglist *sglist -- a pointer to a scatterlist
int nents -- how many entries in the scatterlist
enum dma_data_direction direction --- which way the 
dma is going to happen.

The scatterlist should be sized at least size/PAGESIZE + 2.

usr_pci_mmap() will call pci_map_sg() on the virtual region, 
then copy the resulting scatterlist into *sglist.  The nents field 
will be updated with the actual number of scatterlist entries filled in.

Failure codes are:
EINVAL: the fd wasn't obtained from usr_pci_open, or 
direction wasn't one of DMA_TO_DEVICE, DMA_FROM_DEVICE 
or DMA_BIDIRECTIONAL, or the size of the 
scatterlist is insufficient to map the region.
EFAULT: mp was a bad pointer, or the region of memory spanned 
by (virtaddr, virtaddr + size) was not all mapped.
ENOMEM: insufficient appropriate memory

   long usr_pci_munmap(int fd, struct mapping_info *mp)
Unmap a dma region mapped by usr_pci_map().
Struct mapping info is the same one used in usr_pci_mmap().

Error codes are:
EINVAL: : the fd wasn't obtained from usr_pci_open, or the 
  struct mapping_info was never mapped for this device


Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>  


#
# drivers/Makefile   |3 
# drivers/pci/Kconfig|6 
# drivers/usr/Makefile   |2 
# drivers/usr/sys.c  |  952 
+
# include/linux/usrdrv.h |   63 +++
# 5 files changed, 1026 insertions(+)
#
Index: linux-2.6.11-usrdrivers/drivers/Makefile
===
--- linux-2.6.11-usrdrivers.orig/drivers/Makefile   2005-03-11 
12:25:29.169139978 +1100
+++ linux-2.6.11-usrdrivers/drivers/Makefile2005-03-11 12:25:41.159270471 
+1100
@@ -13,6 +13,9 @@
 # was used and do nothing if so
 obj-$(CONFIG_PNP)  += pnp/
 
+# User level device drivers
+obj-$(CONFIG_USRDEV)   += usr/
+
 # char/ comes before serial/ etc so that the VT console is the boot-time
 # default.
 obj-y  += char/
Index: linux-2.6.11-usrdrivers/drivers/usr/Makefile
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.11-usrdrivers/drivers/usr/Makefile2005-03-11 
12:25:41.160247026 +1100
@@ -0,0 +1,2 @@
+obj-y  += sys.o 
+obj-$(CONFIG_USRBLKDEV) += blkdev.o
Index: linux-2.6.11-usrdrivers/drivers/usr/sys.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.11-usrdrivers/drivers/usr/sys.c   2005-03-11 14:15:59.897394833 
+1100
@@ -0,0 +1,952 @@
+/*
+ * Expose PCI-DMA interface to user mode.
+ *
+ * Copyrig

User mode drivers: part 1, interrupt handling (patch for 2.6.11)

2005-03-10 Thread Peter Chubb

As many of you will be aware, we've been working on infrastructure for
user-mode PCI and other drivers.  The first step is to be able to
handle interrupts from user space. Subsequent patches add
infrastructure for setting up DMA for PCI devices.

The user-level interrupt code doesn't depend on the other patches, and
is probably the most mature of this patchset.


This patch adds a new file to /proc/irq// called irq.  Suitably 
privileged processes can open this file.  Reading the file returns the 
number of interrupts (if any) that have occurred since the last read.
If the file is opened in blocking mode, reading it blocks until 
an interrupt occurs.  poll(2) and select(2) work as one would expect, to 
allow interrupts to be one of many events to wait for.
(If you didn't like the file, one could have a special system call to
return the file descriptor).

Interrupts are usually masked; while a thread is in poll(2) or read(2) on the 
file they are unmasked.  

All architectures that use CONFIG_GENERIC_HARDIRQ are supported by
this patch.

A low latency user level interrupt handler would do something like
this, on a CONFIG_PREEMPT kernel:

  int irqfd;
  int n_ints;
  struct sched_param sched_param;

  irqfd = open("/proc/irq/513/irq", O_RDONLY);
  mlockall()
  sched_param.sched_priority = sched_get_priority_max(SCHED_FIFO) - 10;
  sched_setscheduler(0, SCHED_FIFO, &sched_param);

  while(read(irqfd, n_ints, sizeof n_ints) == sizeof nints) {
   ... talk to device to handle interrupt
  }

If you don't care about latency, then forget about the mlockall() and
setting the priority, and you don't need CONFIG_PREEMPT.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

 kernel/irq/proc.c |  163 ++
 1 files changed, 153 insertions(+), 10 deletions(-)

Index: linux-2.6.11-usrdrivers/kernel/irq/proc.c
===
--- linux-2.6.11-usrdrivers.orig/kernel/irq/proc.c  2005-03-11 
10:30:57.875619102 +1100
+++ linux-2.6.11-usrdrivers/kernel/irq/proc.c   2005-03-11 10:45:07.146928168 
+1100
@@ -9,6 +9,8 @@
 #include 
 #include 
 #include 
+#include 
+#include "internals.h"
 
 static struct proc_dir_entry *root_irq_dir, *irq_dir[NR_IRQS];
 
@@ -90,27 +92,168 @@
action->dir = proc_mkdir(name, irq_dir[irq]);
 }
 
+struct irq_proc {
+   unsigned long irq;
+   wait_queue_head_t q;
+   atomic_t count;
+   char devname[TASK_COMM_LEN];
+};
+ 
+static irqreturn_t irq_proc_irq_handler(int irq, void *vidp, struct pt_regs 
*regs)
+{
+   struct irq_proc *idp = (struct irq_proc *)vidp;
+ 
+   BUG_ON(idp->irq != irq);
+   disable_irq_nosync(irq);
+   atomic_inc(&idp->count);
+   wake_up(&idp->q);
+   return IRQ_HANDLED;
+}
+ 
+
+/*
+ * Signal to userspace an interrupt has occured.
+ */
+static ssize_t irq_proc_read(struct file *filp, char  __user *bufp, size_t 
len, loff_t *ppos)
+{
+   struct irq_proc *ip = (struct irq_proc *)filp->private_data;
+   irq_desc_t *idp = irq_desc + ip->irq;
+   int pending;
+   
+   DEFINE_WAIT(wait);
+   
+   if (len < sizeof(int))
+   return -EINVAL;
+   
+   pending = atomic_read(&ip->count);
+   if (pending == 0) {
+   if (idp->status & IRQ_DISABLED)
+   enable_irq(ip->irq);
+   if (filp->f_flags & O_NONBLOCK)
+   return -EWOULDBLOCK;
+   }
+   
+   while (pending == 0) {
+   prepare_to_wait(&ip->q, &wait, TASK_INTERRUPTIBLE);
+   pending = atomic_read(&ip->count);
+   if (pending == 0)
+   schedule();
+   finish_wait(&ip->q, &wait);
+   if (signal_pending(current))
+   return -ERESTARTSYS;
+   }
+   
+   if (copy_to_user(bufp, &pending, sizeof pending))
+   return -EFAULT;
+
+   *ppos += sizeof pending;
+   
+   atomic_sub(pending, &ip->count);
+   return sizeof pending;
+}
+
+
+static int irq_proc_open(struct inode *inop, struct file *filp)
+{
+   struct irq_proc *ip;
+   struct proc_dir_entry *ent = PDE(inop);
+   int error;
+
+   ip = kmalloc(sizeof *ip, GFP_KERNEL);
+   if (ip == NULL)
+   return -ENOMEM;
+   
+   memset(ip, 0, sizeof(*ip));
+   strcpy(ip->devname, current->comm);
+   init_waitqueue_head(&ip->q);
+   atomic_set(&ip->count, 0);
+   ip->irq = (unsigned long)ent->data;
+   
+   error = request_irq(ip->irq,
+   irq_proc_irq_handler,
+   SA_INTERRUPT,
+   ip->devname,
+   ip);
+   if (error < 0) {
+   

Re: binary drivers and development

2005-03-10 Thread Peter Chubb
> "John" == John Richard Moser <[EMAIL PROTECTED]> writes:


John> I've done more thought, here's a small list of advantages on
John> using binary drivers, specifically considering UDI.  You can
John> consider a different implementation for binary drivers as well,
John> with most of the same advantages.

Almost all these advantages are also present for user-mode drivers...
and getting drivers out of the kernel, where possible, is a much
better approach IMHO than trying to maintain a leaky in-kernel
interface.  The problem with in-kernel interfaces, even if set in
concrete, is that any binary driver can go outside the interface ---
there's no encapsulation --- and so break when the kernel changes.

Peter C


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading large /proc entry from kernel module

2005-03-08 Thread Peter Chubb
>>>>> "Kristian" == Kristian Sørensen <[EMAIL PROTECTED]> writes:

Kristian> Hi all!  I have some trouble reading a 2346 byte /proc entry
Kristian> from our Umbrella kernel module.


Kristian> static int umb_proc_write(struct file *file, const char *buffer,
Kristian>  unsigned long count, void *data) {
Kristian>   char *policy;
Kristian>   int *lbuf;
Kristian>   int i;

Here's your problem:  lbuf should be a char * not an int *.
When you look lbuf[0] you'll get the first four characters packed
into the int.
-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fixing address space lock contention in 2.6.11

2005-03-02 Thread Peter Chubb

Sorry, forgot the `signed-off-by'...

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fixing address space lock contention in 2.6.11

2005-03-02 Thread Peter Chubb

Hi,
As part of the Gelato scalability focus group, we've been running
OSDL's Re-AIM7 benchmark with an I/O intensive load with varying
numbers of processors.  The current kernel shows severe contention on
the tree_lock in the address space structure when running on tmpfs or
ext2 on a RAM disk.


Lockstat output for a 12-way:

SPINLOCKS HOLDWAIT
  UTIL  CONMEAN(  MAX )   MEAN(  MAX )(% CPU) TOTAL NOWAIT SPIN RJECT  
NAME

5.5%  0.4us(3177us)   28us(  20ms)(44.2%) 131821954 94.5%  5.5% 0.00%  
*TOTAL*

 72.3% 13.1%  0.5us( 9.5us)   29us(  20ms)(42.5%)  50542055 86.9% 13.1%0%  
find_lock_page+0x30
 23.8%0%  385us(3177us)0us23235  100%0%0%  
exit_mmap+0x50
 11.5% 0.82%  0.1us( 101us)   17us(5670us)( 1.6%)  50665658 99.2% 0.82%0%  
dnotify_parent+0x70


Replacing the spinlock with a multi-reader lock fixes this problem,
without unduly affecting anything else.

Here are the benchmark results (jobs per minute at a 50-client level,
average of 5 runs, standard deviation in parens) on an HP Olympia with
3 cells, 12 processors, and dnotify turned off (after this spinlock,
the spinlock in dnotify_parent is the worst contended for this workload).

 tmpfs...   ext2...
#CPUsspinlock  rwlock   spinlock rwlock
1 7556(15)  7588(17)  +0.42%  3744(20) 3791(16) +1.25%
213743(31) 13791(33)  +0.35%  6405(30) 6413(24) +0.12%
423334(111)22881(154) -2%9648(51) 9595(50)  -0.55%
833580(240)36163(190) +7.7% 13183(63)13070(68)  -0.85%
   1228748(170)44064(238)+53%  12681(49) 14504(105)+14%  

And on a pentium3 single processsor:
14177(4)4169(2)  -0.2%3811(4) 3820(3) +0.23%

I'm not sure what's happening in the 4-processor case.  The important
thing to note is that with a spinlock, the benchmark shows worse
performance for a 12 than for an 8-way box; with the patch, the 12 way
performs better, as expected.  We've done some runs with 16-way as
well; without the patch below, the 16-way performs worse than the
12-way.

Anyway, here's the patch to convert the address space lock to a
rwlock, and allow multiple processes to scan an address-space's radix
tree at once.

= drivers/mtd/devices/block2mtd.c 1.4 vs edited =
--- 1.4/drivers/mtd/devices/block2mtd.c 2005-02-02 19:27:37 +11:00
+++ edited/drivers/mtd/devices/block2mtd.c  2005-02-22 14:28:23 +11:00
@@ -59,7 +59,7 @@ void cache_readahead(struct address_spac
 
end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
-   spin_lock_irq(&mapping->tree_lock);
+   read_lock_irq(&mapping->tree_lock);
for (i = 0; i < PAGE_READAHEAD; i++) {
pagei = index + i;
if (pagei > end_index) {
@@ -71,16 +71,16 @@ void cache_readahead(struct address_spac
break;
if (page)
continue;
-   spin_unlock_irq(&mapping->tree_lock);
+   read_unlock_irq(&mapping->tree_lock);
page = page_cache_alloc_cold(mapping);
-   spin_lock_irq(&mapping->tree_lock);
+   read_lock_irq(&mapping->tree_lock);
if (!page)
break;
page->index = pagei;
list_add(&page->lru, &page_pool);
ret++;
}
-   spin_unlock_irq(&mapping->tree_lock);
+   read_unlock_irq(&mapping->tree_lock);
if (ret)
read_cache_pages(mapping, &page_pool, filler, NULL);
 }
= fs/buffer.c 1.271 vs edited =
--- 1.271/fs/buffer.c   2005-02-18 20:44:07 +11:00
+++ edited/fs/buffer.c  2005-02-22 14:31:41 +11:00
@@ -875,7 +875,7 @@ int __set_page_dirty_buffers(struct page
spin_unlock(&mapping->private_lock);
 
if (!TestSetPageDirty(page)) {
-   spin_lock_irq(&mapping->tree_lock);
+   read_lock_irq(&mapping->tree_lock);
if (page->mapping) {/* Race with truncate? */
if (!mapping->backing_dev_info->memory_backed)
inc_page_state(nr_dirty);
@@ -883,7 +883,7 @@ int __set_page_dirty_buffers(struct page
page_index(page),
PAGECACHE_TAG_DIRTY);
}
-   spin_unlock_irq(&mapping->tree_lock);
+   read_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}

= fs/inode.c 1.143 vs edited =
--- 1.143/fs/inode.c2005-01-21 16:02:13 +11:00
+++ edited/fs/inode.c   2005-02-22 14:16:33 +11:00
@@ -196,7 +196,7 @@ void inode_init_once(struct inode *inode
sema_init(&inode->i_sem, 1);
init_rwsem(&inode->i_alloc_sem);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);

Re: [PATCH] Linux-2.6.11-rc5: kernel/sys.c setrlimit() RLIMIT_RSS cleanup

2005-02-27 Thread Peter Chubb
>>>>> "Andrew" == Andrew Morton <[EMAIL PROTECTED]> writes:

Andrew> <[EMAIL PROTECTED]> wrote:
>>  $ ulimit -m 10 bash: ulimit: max memory size: cannot modify
>> limit: Function not implemented

Andrew> I don't know about this.  The change could cause existing
Andrew> applications and scripts to fail.  Sure, we'll do that
Andrew> sometimes but this doesn't seem important enough. 

What's more, there have been (and still are) out-of-tree patches to
enforce rlimit-RSS in various ways.  There just hasn't been consensus
yet on the best implementation.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Xterm Hangs - Possible scheduler defect?

2005-02-24 Thread Peter Chubb
>>>>> "Chad" == Chad N Tindel <[EMAIL PROTECTED]> writes:

Chad> I would make the following assertion for any kernel:

Chad> No single userspace thread of execution running on an SMP system
Chad> should be able to hose a box by going CPU-bound, bug in the
Chad> software or no bug.  Any kernel should be able to handle this
Chad> case and shift general work over to other processors.

In many Unices, crucial kernel threads run at realtime priority with a
static priority higher than is accessible to user code.

That being said, however, you've got to be a privileged user to set
real time very high priority on a thread, and if you do, you'd better
know what you're doing.  Any SCHED_FIFO thread should run for a time,
then sleep for a time, or it *will* DOS everything else on the
processor.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Help enabling PCI interrupts on Dell/SMP and Sun/SMP systems.

2005-02-23 Thread Peter Chubb
>>>>> "Alan" == Alan Kilian <[EMAIL PROTECTED]> writes:






Alan>   kernel: SSE: Found a DeCypher card.  kernel: ACPI: PCI
Alan> interrupt :13:03.0[A] -> GSI 36 (level, low) -> IRQ 217

If ACPI has set this device up to use interrupt 217, why are you
registering it on IRQ 5?

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Repeatable hang with XFS under 2.6.11-rc4

2005-02-14 Thread Peter Chubb

Running Reaim-7 on a 4G ram disk with 4 processors on
Itanium... Every few runs, as the multiprocessing level increases, we
see 22 processes hung in sync(), all except one waiting in
sync_filesystems() and that one waiting in pagebuf_iowait().

There's lots of free memory, the ram-disk is not full, ...
Load average is low; nothing in the logs or on the console.

[EMAIL PROTECTED]:/proc# vmstat 2 
procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 0  0 0 23027552 1091472 218496  00 1 42107   12 6 1 21 78  0
 0  0 0 23027552 1091472 218496  00 0 0 411010 0  0 100 0
 0  0 0 23027552 1091472 218496  00 0 0 4109 8 0  0 100 0
 0  0 0 23027488 1091472 218496  00 032 411415 0  0 100 0
 0  0 0 23027488 1091472 218496  00 0 0 4110 9 0  0 100 0
 0  0 0 23027488 1091472 218496  00 0 0 4109 9 0  0 100 0
 
[EMAIL PROTECTED]:/proc/fs/xfs# df /mnt/ram-disk
Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/ram1  1038336127800910536  13% /mnt/ram-disk


-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


JBD problems in linux 2.6.11 rc3

2005-02-09 Thread Peter Chubb
  sp=e000165af810
bsp=e000165a9520
 [] die_if_kernel+0x40/0x60
sp=e000165af810
bsp=e000165a94f0
 [] ia64_bad_break+0x220/0x340
sp=e000165af810
bsp=e000165a94c8
 [] ia64_leave_kernel+0x0/0x260
sp=e000165af8a0
bsp=e000165a94c8
 [] cascade+0xf0/0x100
sp=e000165afa70
bsp=e000165a9468
 [] run_timer_softirq+0x370/0x460
sp=e000165afa70
bsp=e000165a93d8
 [] __do_softirq+0x200/0x240
sp=e000165afa90
bsp=e000165a9338
 [] do_softirq+0x80/0xe0
sp=e000165afa90
bsp=e000165a92d8
 [] irq_exit+0x80/0xa0
sp=e000165afa90
bsp=e000165a92c0
 [] ia64_handle_irq+0x110/0x140
sp=e000165afa90
bsp=e000165a9288
 [] ia64_leave_kernel+0x0/0x260
sp=e000165afa90
bsp=e000165a9288
 [] ia64_spinlock_contention+0x20/0x60
sp=e000165afc60
bsp=e000165a9288
 [] _spin_lock+0x40/0x60
sp=e000165afc60
bsp=e000165a9280
 [] journal_dirty_data+0x1b0/0x760
sp=e000165afc60
bsp=e000165a9230
 [] ext3_journal_dirty_data+0x30/0xa0
sp=e000165afc60
bsp=e000165a9200
 [] walk_page_buffers+0x160/0x180
sp=e000165afc60
bsp=e000165a9180
 [] ext3_ordered_commit_write+0x70/0x180
sp=e000165afc60
bsp=e000165a9128
 [] generic_file_buffered_write+0x520/0xca0
sp=e000165afc60
bsp=e000165a9030
 [] __generic_file_aio_write_nolock+0x420/0x6e0
sp=e000165afd10
bsp=e000165a8fb8
 [] generic_file_aio_write+0xd0/0x240
sp=e000165afd30
bsp=e000165a8f60
 [] ext3_file_write+0x60/0x220
sp=e000165afd40
bsp=e000165a8f28
 [] do_sync_write+0x130/0x180
sp=e000165afd40
bsp=e000165a8ee8
 [] vfs_write+0x1d0/0x2a0
sp=e000165afe20
bsp=e000165a8ea0
 [] sys_write+0x80/0xe0
sp=e000165afe20
bsp=e000165a8e20
 [] ia64_ret_from_syscall+0x0/0x20
sp=e000165afe30
bsp=e000165a8e20
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
 

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


OOPS when using UDF fs

2005-02-04 Thread Peter Chubb

When I try to write to a UDF fs on a USB-connected Ricoh dvd-burner,
(specificly, create a directory)
I get:

usb-storage: Attempting to get CSW...
usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes
usb-storage: Status code 0; transferred 13/13
usb-storage: -- transfer complete
usb-storage: Bulk status result = 0
usb-storage: Bulk Status S 0x53425355 T 0x80b f R 0 Stat 0x0
usb-storage: -- Result from auto-sense is 0
usb-storage: -- code: 0x70, key: 0x5, ASC: 0x2c, ASCQ: 0x0
usb-storage: (Unknown Key): (unknown ASC/ASCQ)
usb-storage: scsi cmd done, result=0x2
usb-storage: *** thread sleeping.
end_request: I/O error, dev sr0, sector 1096
Unable to handle kernel paging request at virtual address 2101
 printing eip: 
e1a3562e
*pde = 
Oops:  [#1]
PREEMPT 
Modules linked in: loop udf usb_storage sr_mod orinoco_cs orinoco hermes pcmcia 
ehci_hcd uhci_hcd yenta_socket
rsrc_nonstatic pcmcia_core snd_intel8x0 snd_ac97_codec snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer
 snd snd_page_alloc i2c_i801
CPU:0
EIP:0060:[pg0+558581294/1067963392]   Not tainted VLI
EFLAGS: 00010293   (2.6.11-rc3) 
EIP is at udf_get_filelongad+0x1e/0x50 [udf]
eax: 21c1   ebx: 2101   ecx: ce301e30   edx: 000d2d2a
esi: 2101   edi: 2101   ebp: ce301e30   esp: ce301d84
ds: 007b   es: 007b   ss: 0068
Process cp (pid: 4869, threadinfo=ce30 task=cdefda00)
Stack: 0112 d03e5c6c e1a2dadc 0001   01301d9c ce301d9c 
c81b4740 ca82714c  db5f8400  c0155f3f 0002 ce301e28 
   d03e5ca4 ce301e40 ce301e30 e1a2d9a6 ce301e34 ce301e3c ce301e40 0001 
Call Trace:
 [pg0+558549724/1067963392] udf_current_aext+0xcc/0x1b0 [udf]
 [__wait_on_buffer+47/64] __wait_on_buffer+0x2f/0x40
 [pg0+558549414/1067963392] udf_next_aext+0x46/0xb0 [udf]
 [pg0+558577131/1067963392] udf_discard_prealloc+0xcb/0x2b0 [udf]
 [d_rehash+116/144] d_rehash+0x74/0x90
 [pg0+558533743/1067963392] udf_clear_inode+0x2f/0x40 [udf]
 [clear_inode+180/208] clear_inode+0xb4/0xd0
 [pg0+558529176/1067963392] udf_new_block+0xc8/0xda [udf]
 [generic_forget_inode+270/320] generic_forget_inode+0x10e/0x140
 [iput+83/112] iput+0x53/0x70 
 [pg0+558533527/1067963392] udf_new_inode+0x337/0x34a [udf]
 [do_no_page+413/832] do_no_page+0x19d/0x340
 [pg0+558557856/1067963392] udf_mkdir+0x0/0x220 [udf]
 [generic_forget_inode+270/320] generic_forget_inode+0x10e/0x140
 [iput+83/112] iput+0x53/0x70
 [pg0+558533527/1067963392] udf_new_inode+0x337/0x34a [udf]
 [do_no_page+413/832] do_no_page+0x19d/0x340
 [pg0+558557856/1067963392] udf_mkdir+0x0/0x220 [udf]
 [pg0+558557925/1067963392] udf_mkdir+0x45/0x220 [udf]
 [__d_lookup+161/384] __d_lookup+0xa1/0x180
 [dput+30/576] dput+0x1e/0x240 
 [cached_lookup+29/128] cached_lookup+0x1d/0x80
 [pg0+558557856/1067963392] udf_mkdir+0x0/0x220 [udf]
 [vfs_mkdir+95/160] vfs_mkdir+0x5f/0xa0 
 [sys_mkdir+145/224] sys_mkdir+0x91/0xe0
 [syscall_call+7/11] syscall_call+0x7/0xb
Code: e0 74 a3 e1 e8 f4 63 6e de eb ea 89 f6 83 ec 08 85 c0 89 5c 24 04 89 c3 
74 33 85 c9 74 2f 8b 01 85 c0 78 21 83 c0
10 39 d0 77 1a <8b> 13 85 d2 74 14 8b 54 24 0c 85 d2 74 02 89 01 89 d8 8b 5c 24

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Support for Large Block Devices

2005-01-24 Thread Peter Chubb
>>>>> "Maciej" == Maciej Soltysiak <[EMAIL PROTECTED]> writes:

Maciej> Hi, I was wondering... Why is "Support for Large Block
Maciej> Devices" still an option?

Maciej> Shouldn't it be compiled in always?  Or maybe there are some
Maciej> cons like incompatibility or something?

It's not compiled in always on 32-bit platforms, because
 1.  Most people don't have more than 2TB in a single block device
 2.  64-bit sizes mean increased size of various structures (i.e.,
 less cache-friendly), and slightly slower operations.

On 64-bit platforms it *is* always enabled.
-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]sched: Isochronous class v2 for unprivileged soft rt scheduling

2005-01-20 Thread Peter Chubb
>>>>> "Jack" == Jack O'Quin <[EMAIL PROTECTED]> writes:


Jack> Looks like we need to do another study to determine which
Jack> filesystem works best for multi-track audio recording and
Jack> playback.  XFS looks promising, but only if they get the latency
Jack> right.  Any experience with that?  

The nice thing about audio/video and XFS is that if you know ahead of
time the max size of a file (and you usually do -- because you know
ahead of time how long a take is going to be) you can precreadte the
file as a contiguous chunk, then just fill it in, for minimum disc
latency.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] 'spinlock/rwlock fixes' V3 [1/1]

2005-01-20 Thread Peter Chubb
>>>>> "Chris" == Chris Wedgwood <[EMAIL PROTECTED]> writes:

Chris> On Wed, Jan 19, 2005 at 07:01:04PM -0800, Andrew Morton wrote:

Chris> It still isn't enough to rid of the rwlock_read_locked and
Chris> rwlock_write_locked usage in kernel/spinlock.c as those are
Chris> needed for the cpu_relax() calls so we have to decide on
Chris> suitable names still...  

I suggest reversing the sense of the macros, and having read_can_lock()
and write_can_lock()

Meaning:
read_can_lock() --- a read_lock() would have succeeded
write_can_lock() --- a write_lock() would have succeeded.

IA64 implementation:

#define read_can_lock(x)  (*(volatile int *)x >= 0)
#define write_can_lock(x) (*(volatile int *)x == 0)

Then use them as
 !read_can_lock(x)
where you want the old semantics.  The compiler ought to be smart
enough to optimise the boolean ops.

---
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Horrible regression with -CURRENT from "Don't busy-lock-loop in preemptable spinlocks" patch

2005-01-19 Thread Peter Chubb
>>>>> "Ingo" == Ingo Molnar <[EMAIL PROTECTED]> writes:

Ingo> * Peter Chubb <[EMAIL PROTECTED]> wrote:

>> Here's a patch that adds the missing read_is_locked() and
>> write_is_locked() macros for IA64.  When combined with Ingo's
>> patch, I can boot an SMP kernel with CONFIG_PREEMPT on.
>> 
>> However, I feel these macros are misnamed: read_is_locked() returns
>> true if the lock is held for writing; write_is_locked() returns
>> true if the lock is held for reading or writing.

Ingo> well, 'read_is_locked()' means: "will a read_lock() succeed"

Fail, surely?

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Horrible regression with -CURRENT from "Don't busy-lock-loop in preemptable spinlocks" patch

2005-01-18 Thread Peter Chubb


Here's a patch that adds the missing read_is_locked() and
write_is_locked() macros for IA64.  When combined with Ingo's patch, I
can boot an SMP kernel with CONFIG_PREEMPT on.

However, I feel these macros are misnamed: read_is_locked() returns true if
the lock is held for writing; write_is_locked() returns true if the
lock is held for reading or writing.

Signed-off-by: Peter Chubb <[EMAIL PROTECTED]>

Index: linux-2.6-bklock/include/asm-ia64/spinlock.h
===
--- linux-2.6-bklock.orig/include/asm-ia64/spinlock.h   2005-01-18 
13:46:08.138077857 +1100
+++ linux-2.6-bklock/include/asm-ia64/spinlock.h2005-01-19 
08:58:59.303821753 +1100
@@ -126,8 +126,20 @@
 #define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0 }
 
 #define rwlock_init(x) do { *(x) = RW_LOCK_UNLOCKED; } while(0)
+
 #define rwlock_is_locked(x)(*(volatile int *) (x) != 0)
 
+/* read_is_locked --  - would read_trylock() fail?
+ * @lock: the rwlock in question.
+ */
+#define read_is_locked(x)   (*(volatile int *) (x) < 0)
+
+/**
+ * write_is_locked - would write_trylock() fail?
+ * @lock: the rwlock in question.
+ */
+#define write_is_locked(x) (*(volatile int *) (x) != 0)
+
 #define _raw_read_lock(rw) 
\
 do {   
\
rwlock_t *__read_lock_ptr = (rw);   
    \

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] VM fixes + RSS limits 2.4.0-test13-pre5

2001-01-07 Thread Peter Chubb



Ingo wrote:
> On Wed, Jan 03, 2001 at 09:43:54AM -0200, Rik van Riel wrote:
> > On Fri, 28 Dec 2000, Mike Sklar wrote:
> > > If I wanted to adjust the rlim_cur value of a running
> > > processes, is there any sort of interface for that?
> > 
> > Hmmm, I don't think there is an interface to adjust the
> > per-process ulimit settings on-the-fly ...
> > 
> > Does anybody know if there's an interface for this ?

> If you don't mean "kill -TERM", no there isn't. It would be evil
> to the process anyway.

The RSS limits patch I sent to linux-kernel some time ago provided an
experimental /proc interface to allow exactly this.
The patch against 2.2.16 is still on our FTP server at 

ftp://ftp-au.aurema.com/private/aurpjc31/linux-2216-rsslimit.diff.bz2

Here's the patch against 2.4.0.  The main differences between this and 
Rik's patch are:
  -- you  choose soft or hard limits at kernel config time with my 
  patch; with Rik's you get both (rlim_cur is `soft' rlim_max is
  `hard') 
  -- Rik's patch does some extra stuff to the VM code as well as
 the RSS limits
  -- Rik's patch doesn't affect swap behaviour (except in so far
 as processes over their RSS limit will tend to swap, which reduces
 memory pressure on all other processes); my patch means that
 processes over RSS limit suffer somewhat
  -- My patch puts the limit into the struct mm for slightly more
 cache-friendly behaviour, and to allow later interfacing with
 per-user resource-management software (it should be possible
 to write a kernel module to adjust RSS limits to implement per-user 
 limits without affecting per-process RLIMIT values)
  -- My patch has a /proc interface to allow setting
 rlimit[RLIMIT_RSS]
  -- my patch implements the rss accounting fields so that time -v 
 gives reasonable output


Index: linux-2.4.0/CREDITS
===
RCS file: /wrk/CVSROOT/linux-2.4/CREDITS,v
retrieving revision 1.1.1.5
diff -u -b -u -r1.1.1.5 CREDITS
--- linux-2.4.0/CREDITS 2001/01/04 23:02:54 1.1.1.5
+++ linux-2.4.0/CREDITS 2001/01/08 04:41:41
@@ -491,6 +491,24 @@
 S: Stanford, California 94305
 S: USA
 
+N: Kingsley Cheung
+E: [EMAIL PROTECTED]
+D: Page fault calculation
+D: /proc//rss support
+D: kswapd improvements regarding process RSS limits 
+S: Aurema Pty Limited
+S: PO Box 305, Strawberry Hills NSW 2012, 
+S: Australia 
+
+N: Peter Chubb
+E: [EMAIL PROTECTED]
+D: Page fault calculation
+D: /proc//rss support
+D: kswapd improvements regarding process RSS limits 
+S: Aurema Pty Limited
+S: PO Box 305, Strawberry Hills NSW 2012, 
+S: Australia 
+
 N: Juan Jose Ciarlante
 W: http://juanjox.kernelnotes.org/
 E: [EMAIL PROTECTED]
Index: linux-2.4.0/Documentation/Configure.help
===
RCS file: /wrk/CVSROOT/linux-2.4/Documentation/Configure.help,v
retrieving revision 1.1.1.6
diff -u -b -u -r1.1.1.6 Configure.help
--- linux-2.4.0/Documentation/Configure.help2001/01/07 21:44:33 1.1.1.6
+++ linux-2.4.0/Documentation/Configure.help2001/01/08 04:41:41
@@ -16955,6 +16955,50 @@
   another UltraSPARC-IIi-cEngine boardset with a 7-segment display,
   you should say N to this option. 
 
+RSS Softlimits (EXPERIMENTAL)
+CONFIG_RSS_SOFTLIMIT
+  If you want the setrlimit(RLIMIT_RSS, ...) system call to work, say
+  Y either here or for RSS Hardlimits.  If you don't understand this
+  you don't need it, so say N.
+
+  RSS Softlimits will make it more likely that pages will be stolen
+  from processes that have a resident set size (i.e., real memory
+  footprint) greater than their limit.  Processes with a limit set
+  that is below their actual need may still exceed their limits, and
+  in this instance kswapd may work excessively hard.
+
+  Because of the way that RSS is measured and controlled, the limit is
+  approximate only.
+
+  It is harmless to have RSS Softlimits and RSS Hardlimits both set.
+
+RSS Hardlimits (EXPERIMENTAL)
+CONFIG_RSS_HARDLIMIT
+  If you want the setrlimit(RLIMIT_RSS, ...) system call to work, say
+  Y either here or for RSS Softlimits.  If you don't understand this
+  you don't need it, so say N.
+
+  RSS Hardlimits changes the behaviour of the kernel at page-fault
+  time.  If a process is over its RSS limit when it wants to get a new
+  page, then with this configuration option enabled the process's
+  memory space will be reduced before the page-fault continues.
+
+  Because of the way that RSS is measured and controlled, the actual
+  memory footprint of a process may exceed the set limit for a short
+  time.
+
+  It is harmless to have RSS Softlimits and RSS Hardlimits both set.
+
+Support for /proc/pid/rss (EXPERIMENTAL)
+CONFIG_PROC_RSS
+