Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-10 Thread John Gilmore
 So, the transfer rate dropped from about 9.6mb/sec to about 5.3mb/sec.

I was afraid of this.  Flash is pretty fast, and there's almost no
parallelism available in it (e.g. you can't queue up a bunch of reads
to occur when the data rotates under the heads, or start a seek
early).  Thus it's mostly worth sitting and waiting for the chip to
finish, rather than making the CPU try to do something else.  It all
happens at chip speeds -- and OLPC designed the CaFE chip to run as
fast as the flash chips and memory bus can run (because the Geode's
native flash support, for some reason, didn't go that fast).  [Note,
however, that the Marvell spec for the chip doesn't give ANY speed
specs.  It does say it might be possible to run its PCI bus at 66 MHz
rather than 33 -- which do we do?]

I really think we need to separate the effects of filesystem
compression from the effects of Flash storage.  Other reports show
systems being able to do many thousands of transactions per second to
SSD (flash-based disks), but we don't come anywhere close to that.
Why don't we?  Nobody knows.

There is a trivial way to make an uncompressed NAND image.  Take an
existing build's tar image, extract it to a directory, use the same
mkfs.jffs2 command from the standard build log, but add
--compression-mode none to build an uncompressed image.  Burning
that into NAND will let you test uncompressed reads.  That will give
you best-case reads, too, since all the data for a given file will be
contiguous.  

Perhaps there is some sysfs knob that will also let you turn off
compressed writes.

It's unfortunate that there's no block device way to access Flash
chips in Linux without using a filesystem at all.  It's much easier to
measure and tune I/O performance without a filesystem first, then see
what (filesystem or driver) optimizations are required to make a
particular filesystem fast on that device.

John

PS: During the long delay while the CaFe is erasing a block, it can do
reads from other blocks (but not this one); do we do these in parallel?
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-10 Thread Chris Ball
Hi John,

It's unfortunate that there's no block device way to access Flash
chips in Linux without using a filesystem at all.  It's much easier
to measure and tune I/O performance without a filesystem first,
then see what (filesystem or driver) optimizations are required to
make a particular filesystem fast on that device.

Does e.g. dd if=mtd0 of=somefile bs=1M count=20 accomplish this?
(When a program expects a block device argument, you can give a raw
mtd0 instead.)

- Chris.
-- 
Chris Ball   [EMAIL PROTECTED]
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-10 Thread David Woodhouse
On Wed, 2008-12-10 at 15:36 -0500, Chris Ball wrote:
 Hi John,
 
 It's unfortunate that there's no block device way to access Flash
 chips in Linux without using a filesystem at all.  It's much easier
 to measure and tune I/O performance without a filesystem first,
 then see what (filesystem or driver) optimizations are required to
 make a particular filesystem fast on that device.

What's wrong with just reading from the character device /dev/mtd0?

 Does e.g. dd if=mtd0 of=somefile bs=1M count=20 accomplish this?
 (When a program expects a block device argument, you can give a raw
 mtd0 instead.)

No, that's only for mount.

-- 
David WoodhouseOpen Source Technology Centre
[EMAIL PROTECTED]  Intel Corporation

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


[PATCH] cafe_nand: remove busy-wait loop

2008-12-09 Thread Daniel Drake
This patch enables interrupts for DMA and command completion events, and
uses them to determine when commands and transfers have completed.

This avoids a busy-wait loop which was a waste of CPU time.

Signed-off-by: Daniel Drake [EMAIL PROTECTED]

---
Where do we go with this? Who's adventurous enough to test it? Dare we
put it in joyride? Any suggestions for how to measure performance
difference?

It works for me.

Index: linux-2.6.27-gentoo-r4/drivers/mtd/nand/cafe_nand.c
===
--- linux-2.6.27-gentoo-r4.orig/drivers/mtd/nand/cafe_nand.c
+++ linux-2.6.27-gentoo-r4/drivers/mtd/nand/cafe_nand.c
@@ -17,6 +17,8 @@
 #include linux/delay.h
 #include linux/interrupt.h
 #include linux/dma-mapping.h
+#include linux/spinlock.h
+#include linux/wait.h
 #include asm/io.h
 
 #define CAFE_NAND_CTRL10x00
@@ -51,12 +53,23 @@
 /* Missing from the datasheet: bit 19 of CTRL1 sets CE0 vs. CE1 */
 #define CTRL1_CHIPSELECT   (119)
 
+/* Enable command done and DMA done interrupts */
+static const uint32_t cafe_interrupt_mask = 0x6fff;
+
 struct cafe_priv {
struct nand_chip nand;
struct mtd_partition *parts;
struct pci_dev *pdev;
void __iomem *mmio;
struct rs_control *rs;
+
+   /* waitqueue for interrupt arrival notification */
+   wait_queue_head_t wq;
+
+   /* when an interrupt arrives, the IRQ bits are set in the irqs member */
+   spinlock_t irqs_lock;
+   uint32_t irqs;
+
uint32_t ctl1;
uint32_t ctl2;
int datalen;
@@ -153,6 +166,19 @@ static uint8_t cafe_read_byte(struct mtd
return d;
 }
 
+static uint32_t check_irq(struct cafe_priv *cafe, uint32_t doneint)
+{
+   uint32_t val;
+   unsigned long flags;
+
+   spin_lock_irqsave(cafe-irqs_lock, flags);
+   val = cafe-irqs;
+   spin_unlock_irqrestore(cafe-irqs_lock, flags);
+
+   cafe_dev_dbg(cafe-pdev-dev, check_irq %x\n, val);
+   return val  doneint;
+}
+
 static void cafe_nand_cmdfunc(struct mtd_info *mtd, unsigned command,
  int column, int page_addr)
 {
@@ -160,6 +186,7 @@ static void cafe_nand_cmdfunc(struct mtd
int adrbytes = 0;
uint32_t ctl1;
uint32_t doneint = 0x8000;
+   unsigned long flags;
 
cafe_dev_dbg(cafe-pdev-dev, cmdfunc %02x, 0x%x, 0x%x\n,
command, column, page_addr);
@@ -243,6 +270,10 @@ static void cafe_nand_cmdfunc(struct mtd
cafe_dev_dbg(cafe-pdev-dev, dlen %x, ctl1 %x, ctl2 %x\n,
cafe-datalen, ctl1, cafe_readl(cafe, NAND_CTRL2));
 
+   spin_lock_irqsave(cafe-irqs_lock, flags);
+   cafe-irqs = 0;
+   spin_unlock_irqrestore(cafe-irqs_lock, flags);
+
/* NB: The datasheet lies -- we really should be subtracting 1 here */
cafe_writel(cafe, cafe-datalen, NAND_DATA_LEN);
cafe_writel(cafe, 0x9000, NAND_IRQ);
@@ -273,21 +304,14 @@ static void cafe_nand_cmdfunc(struct mtd
ndelay(100);
 
if (1) {
-   int c;
-   uint32_t irqs;
+   long ret;
+   cafe_dev_dbg(cafe-pdev-dev, Entering wait_event\n);
+   ret = wait_event_timeout(cafe-wq, check_irq(cafe, doneint), HZ 
/ 2);
+   if (unlikely(ret == 0))
+   dev_warn(cafe-pdev-dev, Command wait IRQ 
timeout\n);
 
-   for (c = 50; c != 0; c--) {
-   irqs = cafe_readl(cafe, NAND_IRQ);
-   if (irqs  doneint)
-   break;
-   udelay(1);
-   if (!(c % 10))
-   cafe_dev_dbg(cafe-pdev-dev, Wait for ready, 
IRQ %x\n, irqs);
-   cpu_relax();
-   }
-   cafe_writel(cafe, doneint, NAND_IRQ);
-   cafe_dev_dbg(cafe-pdev-dev, Command %x completed after %d 
usec, irqs %x (%x)\n,
-command, 50-c, irqs, cafe_readl(cafe, 
NAND_IRQ));
+   cafe_dev_dbg(cafe-pdev-dev, Command %x completed in %ld 
jiffies, irqs %x\n,
+command, (HZ / 2) - ret, cafe_readl(cafe, 
NAND_IRQ));
}
 
WARN_ON(cafe-ctl2  (130));
@@ -334,11 +358,19 @@ static int cafe_nand_interrupt(int irq, 
struct mtd_info *mtd = id;
struct cafe_priv *cafe = mtd-priv;
uint32_t irqs = cafe_readl(cafe, NAND_IRQ);
-   cafe_writel(cafe, irqs  ~0x9000, NAND_IRQ);
if (!irqs)
return IRQ_NONE;
 
+   /* clear interrupts */
+   cafe_writel(cafe, irqs, NAND_IRQ);
+
cafe_dev_dbg(cafe-pdev-dev, irq, bits %x (%x)\n, irqs, 
cafe_readl(cafe, NAND_IRQ));
+
+   spin_lock(cafe-irqs_lock);
+   cafe-irqs |= irqs;
+   spin_unlock(cafe-irqs_lock);
+
+   wake_up(cafe-wq);
return IRQ_HANDLED;
 }
 
@@ -655,6 +687,8 @@ static int __devinit cafe_nand_probe(str

Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-09 Thread Mitch Bradley
Nice!

One quick test, for starters, is to do some file accesses to large files 
(dd or tar or whatever) and use the time command to look at the system 
and user times.  With the patch, the system time should drop to a 
fraction of the real time.

Daniel Drake wrote:
 This patch enables interrupts for DMA and command completion events, and
 uses them to determine when commands and transfers have completed.

 This avoids a busy-wait loop which was a waste of CPU time.

 Signed-off-by: Daniel Drake [EMAIL PROTECTED]

 ---
 Where do we go with this? Who's adventurous enough to test it? Dare we
 put it in joyride? Any suggestions for how to measure performance
 difference?
   

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-09 Thread Tomeu Vizoso
On Tue, Dec 9, 2008 at 6:05 PM, Mitch Bradley [EMAIL PROTECTED] wrote:
 Nice!

 One quick test, for starters, is to do some file accesses to large files
 (dd or tar or whatever) and use the time command to look at the system
 and user times.  With the patch, the system time should drop to a
 fraction of the real time.

Wow! This should apply as well to writes?

I reported some time ago that downloads from the LAN were limited by
cpu time, but I think to remember that it was due to the
autocompression.

Regards,

Tomeu

 Daniel Drake wrote:
 This patch enables interrupts for DMA and command completion events, and
 uses them to determine when commands and transfers have completed.

 This avoids a busy-wait loop which was a waste of CPU time.

 Signed-off-by: Daniel Drake [EMAIL PROTECTED]

 ---
 Where do we go with this? Who's adventurous enough to test it? Dare we
 put it in joyride? Any suggestions for how to measure performance
 difference?


 ___
 Devel mailing list
 Devel@lists.laptop.org
 http://lists.laptop.org/listinfo/devel

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-09 Thread Mitch Bradley
Tomeu Vizoso wrote:
 On Tue, Dec 9, 2008 at 6:05 PM, Mitch Bradley [EMAIL PROTECTED] wrote:
   
 Nice!

 One quick test, for starters, is to do some file accesses to large files
 (dd or tar or whatever) and use the time command to look at the system
 and user times.  With the patch, the system time should drop to a
 fraction of the real time.
 

 Wow! This should apply as well to writes?

 I reported some time ago that downloads from the LAN were limited by
 cpu time, but I think to remember that it was due to the
 autocompression.
   

In principle, for longish writes the compression of the next chunk could 
be overlapped with the write-back of the current chunk.  It will be 
interesting to see if that actually occurs, or if additional work is 
needed to unlock the potential parallelism.

At any rate, even for short writes, the userland program should be able 
to get the CPU back as soon as the write to NAND is initiated, thus 
saving 100 usec of otherwise-wasted time.

 Regards,

 Tomeu

   
 Daniel Drake wrote:
 
 This patch enables interrupts for DMA and command completion events, and
 uses them to determine when commands and transfers have completed.

 This avoids a busy-wait loop which was a waste of CPU time.

 Signed-off-by: Daniel Drake [EMAIL PROTECTED]

 ---
 Where do we go with this? Who's adventurous enough to test it? Dare we
 put it in joyride? Any suggestions for how to measure performance
 difference?

   
 ___
 Devel mailing list
 Devel@lists.laptop.org
 http://lists.laptop.org/listinfo/devel

 

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-09 Thread Jim Gettys
On Tue, 2008-12-09 at 07:05 -1000, Mitch Bradley wrote:

  Where do we go with this? Who's adventurous enough to test it? Dare we
  put it in joyride? Any suggestions for how to measure performance
  difference?


If Deepak likes it and it passes some tests, that would be the logical
thing to do next.
  - Jim

-- 
Jim Gettys [EMAIL PROTECTED]
One Laptop Per Child

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-09 Thread Ed McNierney
And then Mitch tosses the next ball to Daniel to catch.


On 12/9/08 12:22 PM, Jim Gettys [EMAIL PROTECTED] wrote:

 On Tue, 2008-12-09 at 07:05 -1000, Mitch Bradley wrote:
 
 Where do we go with this? Who's adventurous enough to test it? Dare we
 put it in joyride? Any suggestions for how to measure performance
 difference?
   
 
 If Deepak likes it and it passes some tests, that would be the logical
 thing to do next.
   - Jim


___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-09 Thread Deepak Saxena
On Dec 09 2008, at 16:05, Daniel Drake was caught saying:
 This patch enables interrupts for DMA and command completion events, and
 uses them to determine when commands and transfers have completed.
 
 This avoids a busy-wait loop which was a waste of CPU time.
 
 Signed-off-by: Daniel Drake [EMAIL PROTECTED]

Looks good, but since we're just checking for one bit, I think
we can we just use an atomic_t instead of needing a whole spinlock.

~Deepak

-- 
 Deepak Saxena - Kernel Developer, One Laptop Per Child
   _   __o   (o
---\,  Give One Laptop, Get One Laptop  //\
 - ( )/ ( )  http://www.amazon.com/xoV_/_

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: [PATCH] cafe_nand: remove busy-wait loop

2008-12-09 Thread Daniel Drake
On Tue, Dec 9, 2008 at 5:05 PM, Mitch Bradley [EMAIL PROTECTED] wrote:
 Nice!

 One quick test, for starters, is to do some file accesses to large files (dd
 or tar or whatever) and use the time command to look at the system and
 user times.  With the patch, the system time should drop to a fraction of
 the real time.

The system time does drop considerably, but unfortunately the transfer
rate also decreases significantly. I think this is because many of the
commands execute very quickly (1-10us), therefore it is wasteful to go
and schedule a new process only to come back immediately after.
Perhaps we can come up with a scheme that does a fast busy-loop for
the quick commands, and then schedules for the longer ones.

Here are the results anyway. Both experiments running on joyride-2579,
jffs2 filesystem, and a 50mb randfile that I generated on another
computer from /dev/urandom

First experiment was echo 3  /proc/sys/vm/drop_caches; time dd
if=randfile of=/dev/null
that is, reading the file from the NAND, with caches cold. Repeated 3 times.

Before:
52428800 bytes (52 MB) copied, 5.4499 s, 9.6 MB/s
real0m7.324s
sys 0m7.080s

52428800 bytes (52 MB) copied, 5.38514 s, 9.7 MB/s
real0m5.452s
sys 0m5.190s

52428800 bytes (52 MB) copied, 5.41167 s, 9.7 MB/s
real0m5.480s
sys 0m5.210s



After:
52428800 bytes (52 MB) copied, 10.2002 s, 5.1 MB/s
real0m24.593s
sys 0m2.310s

52428800 bytes (52 MB) copied, 10.166 s, 5.2 MB/s
real0m10.266s
sys 0m0.910s

52428800 bytes (52 MB) copied, 9.72464 s, 5.4 MB/s
real0m9.840s
sys 0m1.210s

So, the transfer rate dropped from about 9.6mb/sec to about 5.3mb/sec.
However, sys time decreased by a factor of about 5. Also note that the
first run in each case took significantly longer (especially in the
after case, perhaps an anomaly) presumably because it had to load 'dd'
from nand too.



Second experiment: duplicate the file on disk, with hot caches so that
we are measuring write (and not also read).
rm copy; cat randfile  /dev/null; time cp randfile copy repeated 3 times

Before:
real0m56.459s
sys 0m51.280s

real0m57.919s
sys 0m51.300s

real0m56.016s
sys 0m50.970s

After:
real1m36.600s
sys 0m7.480s

real1m19.303s
sys 0m6.980s

real1m28.450s
sys 0m6.930s


Again, a considerable increase in real time, and a considerable
decrease in sys time.

I'll try and find some time to identify which commands are the ones
that take long, and then try a patch that only avoids the busy-wait
for those ones.

Daniel
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel