Re: [PATCH] cafe_nand: remove busy-wait loop
So, the transfer rate dropped from about 9.6mb/sec to about 5.3mb/sec. I was afraid of this. Flash is pretty fast, and there's almost no parallelism available in it (e.g. you can't queue up a bunch of reads to occur when the data rotates under the heads, or start a seek early). Thus it's mostly worth sitting and waiting for the chip to finish, rather than making the CPU try to do something else. It all happens at chip speeds -- and OLPC designed the CaFE chip to run as fast as the flash chips and memory bus can run (because the Geode's native flash support, for some reason, didn't go that fast). [Note, however, that the Marvell spec for the chip doesn't give ANY speed specs. It does say it might be possible to run its PCI bus at 66 MHz rather than 33 -- which do we do?] I really think we need to separate the effects of filesystem compression from the effects of Flash storage. Other reports show systems being able to do many thousands of transactions per second to SSD (flash-based disks), but we don't come anywhere close to that. Why don't we? Nobody knows. There is a trivial way to make an uncompressed NAND image. Take an existing build's tar image, extract it to a directory, use the same mkfs.jffs2 command from the standard build log, but add --compression-mode none to build an uncompressed image. Burning that into NAND will let you test uncompressed reads. That will give you best-case reads, too, since all the data for a given file will be contiguous. Perhaps there is some sysfs knob that will also let you turn off compressed writes. It's unfortunate that there's no block device way to access Flash chips in Linux without using a filesystem at all. It's much easier to measure and tune I/O performance without a filesystem first, then see what (filesystem or driver) optimizations are required to make a particular filesystem fast on that device. John PS: During the long delay while the CaFe is erasing a block, it can do reads from other blocks (but not this one); do we do these in parallel? ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: [PATCH] cafe_nand: remove busy-wait loop
Hi John, It's unfortunate that there's no block device way to access Flash chips in Linux without using a filesystem at all. It's much easier to measure and tune I/O performance without a filesystem first, then see what (filesystem or driver) optimizations are required to make a particular filesystem fast on that device. Does e.g. dd if=mtd0 of=somefile bs=1M count=20 accomplish this? (When a program expects a block device argument, you can give a raw mtd0 instead.) - Chris. -- Chris Ball [EMAIL PROTECTED] ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: [PATCH] cafe_nand: remove busy-wait loop
On Wed, 2008-12-10 at 15:36 -0500, Chris Ball wrote: Hi John, It's unfortunate that there's no block device way to access Flash chips in Linux without using a filesystem at all. It's much easier to measure and tune I/O performance without a filesystem first, then see what (filesystem or driver) optimizations are required to make a particular filesystem fast on that device. What's wrong with just reading from the character device /dev/mtd0? Does e.g. dd if=mtd0 of=somefile bs=1M count=20 accomplish this? (When a program expects a block device argument, you can give a raw mtd0 instead.) No, that's only for mount. -- David WoodhouseOpen Source Technology Centre [EMAIL PROTECTED] Intel Corporation ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
[PATCH] cafe_nand: remove busy-wait loop
This patch enables interrupts for DMA and command completion events, and uses them to determine when commands and transfers have completed. This avoids a busy-wait loop which was a waste of CPU time. Signed-off-by: Daniel Drake [EMAIL PROTECTED] --- Where do we go with this? Who's adventurous enough to test it? Dare we put it in joyride? Any suggestions for how to measure performance difference? It works for me. Index: linux-2.6.27-gentoo-r4/drivers/mtd/nand/cafe_nand.c === --- linux-2.6.27-gentoo-r4.orig/drivers/mtd/nand/cafe_nand.c +++ linux-2.6.27-gentoo-r4/drivers/mtd/nand/cafe_nand.c @@ -17,6 +17,8 @@ #include linux/delay.h #include linux/interrupt.h #include linux/dma-mapping.h +#include linux/spinlock.h +#include linux/wait.h #include asm/io.h #define CAFE_NAND_CTRL10x00 @@ -51,12 +53,23 @@ /* Missing from the datasheet: bit 19 of CTRL1 sets CE0 vs. CE1 */ #define CTRL1_CHIPSELECT (119) +/* Enable command done and DMA done interrupts */ +static const uint32_t cafe_interrupt_mask = 0x6fff; + struct cafe_priv { struct nand_chip nand; struct mtd_partition *parts; struct pci_dev *pdev; void __iomem *mmio; struct rs_control *rs; + + /* waitqueue for interrupt arrival notification */ + wait_queue_head_t wq; + + /* when an interrupt arrives, the IRQ bits are set in the irqs member */ + spinlock_t irqs_lock; + uint32_t irqs; + uint32_t ctl1; uint32_t ctl2; int datalen; @@ -153,6 +166,19 @@ static uint8_t cafe_read_byte(struct mtd return d; } +static uint32_t check_irq(struct cafe_priv *cafe, uint32_t doneint) +{ + uint32_t val; + unsigned long flags; + + spin_lock_irqsave(cafe-irqs_lock, flags); + val = cafe-irqs; + spin_unlock_irqrestore(cafe-irqs_lock, flags); + + cafe_dev_dbg(cafe-pdev-dev, check_irq %x\n, val); + return val doneint; +} + static void cafe_nand_cmdfunc(struct mtd_info *mtd, unsigned command, int column, int page_addr) { @@ -160,6 +186,7 @@ static void cafe_nand_cmdfunc(struct mtd int adrbytes = 0; uint32_t ctl1; uint32_t doneint = 0x8000; + unsigned long flags; cafe_dev_dbg(cafe-pdev-dev, cmdfunc %02x, 0x%x, 0x%x\n, command, column, page_addr); @@ -243,6 +270,10 @@ static void cafe_nand_cmdfunc(struct mtd cafe_dev_dbg(cafe-pdev-dev, dlen %x, ctl1 %x, ctl2 %x\n, cafe-datalen, ctl1, cafe_readl(cafe, NAND_CTRL2)); + spin_lock_irqsave(cafe-irqs_lock, flags); + cafe-irqs = 0; + spin_unlock_irqrestore(cafe-irqs_lock, flags); + /* NB: The datasheet lies -- we really should be subtracting 1 here */ cafe_writel(cafe, cafe-datalen, NAND_DATA_LEN); cafe_writel(cafe, 0x9000, NAND_IRQ); @@ -273,21 +304,14 @@ static void cafe_nand_cmdfunc(struct mtd ndelay(100); if (1) { - int c; - uint32_t irqs; + long ret; + cafe_dev_dbg(cafe-pdev-dev, Entering wait_event\n); + ret = wait_event_timeout(cafe-wq, check_irq(cafe, doneint), HZ / 2); + if (unlikely(ret == 0)) + dev_warn(cafe-pdev-dev, Command wait IRQ timeout\n); - for (c = 50; c != 0; c--) { - irqs = cafe_readl(cafe, NAND_IRQ); - if (irqs doneint) - break; - udelay(1); - if (!(c % 10)) - cafe_dev_dbg(cafe-pdev-dev, Wait for ready, IRQ %x\n, irqs); - cpu_relax(); - } - cafe_writel(cafe, doneint, NAND_IRQ); - cafe_dev_dbg(cafe-pdev-dev, Command %x completed after %d usec, irqs %x (%x)\n, -command, 50-c, irqs, cafe_readl(cafe, NAND_IRQ)); + cafe_dev_dbg(cafe-pdev-dev, Command %x completed in %ld jiffies, irqs %x\n, +command, (HZ / 2) - ret, cafe_readl(cafe, NAND_IRQ)); } WARN_ON(cafe-ctl2 (130)); @@ -334,11 +358,19 @@ static int cafe_nand_interrupt(int irq, struct mtd_info *mtd = id; struct cafe_priv *cafe = mtd-priv; uint32_t irqs = cafe_readl(cafe, NAND_IRQ); - cafe_writel(cafe, irqs ~0x9000, NAND_IRQ); if (!irqs) return IRQ_NONE; + /* clear interrupts */ + cafe_writel(cafe, irqs, NAND_IRQ); + cafe_dev_dbg(cafe-pdev-dev, irq, bits %x (%x)\n, irqs, cafe_readl(cafe, NAND_IRQ)); + + spin_lock(cafe-irqs_lock); + cafe-irqs |= irqs; + spin_unlock(cafe-irqs_lock); + + wake_up(cafe-wq); return IRQ_HANDLED; } @@ -655,6 +687,8 @@ static int __devinit cafe_nand_probe(str
Re: [PATCH] cafe_nand: remove busy-wait loop
Nice! One quick test, for starters, is to do some file accesses to large files (dd or tar or whatever) and use the time command to look at the system and user times. With the patch, the system time should drop to a fraction of the real time. Daniel Drake wrote: This patch enables interrupts for DMA and command completion events, and uses them to determine when commands and transfers have completed. This avoids a busy-wait loop which was a waste of CPU time. Signed-off-by: Daniel Drake [EMAIL PROTECTED] --- Where do we go with this? Who's adventurous enough to test it? Dare we put it in joyride? Any suggestions for how to measure performance difference? ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: [PATCH] cafe_nand: remove busy-wait loop
On Tue, Dec 9, 2008 at 6:05 PM, Mitch Bradley [EMAIL PROTECTED] wrote: Nice! One quick test, for starters, is to do some file accesses to large files (dd or tar or whatever) and use the time command to look at the system and user times. With the patch, the system time should drop to a fraction of the real time. Wow! This should apply as well to writes? I reported some time ago that downloads from the LAN were limited by cpu time, but I think to remember that it was due to the autocompression. Regards, Tomeu Daniel Drake wrote: This patch enables interrupts for DMA and command completion events, and uses them to determine when commands and transfers have completed. This avoids a busy-wait loop which was a waste of CPU time. Signed-off-by: Daniel Drake [EMAIL PROTECTED] --- Where do we go with this? Who's adventurous enough to test it? Dare we put it in joyride? Any suggestions for how to measure performance difference? ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: [PATCH] cafe_nand: remove busy-wait loop
Tomeu Vizoso wrote: On Tue, Dec 9, 2008 at 6:05 PM, Mitch Bradley [EMAIL PROTECTED] wrote: Nice! One quick test, for starters, is to do some file accesses to large files (dd or tar or whatever) and use the time command to look at the system and user times. With the patch, the system time should drop to a fraction of the real time. Wow! This should apply as well to writes? I reported some time ago that downloads from the LAN were limited by cpu time, but I think to remember that it was due to the autocompression. In principle, for longish writes the compression of the next chunk could be overlapped with the write-back of the current chunk. It will be interesting to see if that actually occurs, or if additional work is needed to unlock the potential parallelism. At any rate, even for short writes, the userland program should be able to get the CPU back as soon as the write to NAND is initiated, thus saving 100 usec of otherwise-wasted time. Regards, Tomeu Daniel Drake wrote: This patch enables interrupts for DMA and command completion events, and uses them to determine when commands and transfers have completed. This avoids a busy-wait loop which was a waste of CPU time. Signed-off-by: Daniel Drake [EMAIL PROTECTED] --- Where do we go with this? Who's adventurous enough to test it? Dare we put it in joyride? Any suggestions for how to measure performance difference? ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: [PATCH] cafe_nand: remove busy-wait loop
On Tue, 2008-12-09 at 07:05 -1000, Mitch Bradley wrote: Where do we go with this? Who's adventurous enough to test it? Dare we put it in joyride? Any suggestions for how to measure performance difference? If Deepak likes it and it passes some tests, that would be the logical thing to do next. - Jim -- Jim Gettys [EMAIL PROTECTED] One Laptop Per Child ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: [PATCH] cafe_nand: remove busy-wait loop
And then Mitch tosses the next ball to Daniel to catch. On 12/9/08 12:22 PM, Jim Gettys [EMAIL PROTECTED] wrote: On Tue, 2008-12-09 at 07:05 -1000, Mitch Bradley wrote: Where do we go with this? Who's adventurous enough to test it? Dare we put it in joyride? Any suggestions for how to measure performance difference? If Deepak likes it and it passes some tests, that would be the logical thing to do next. - Jim ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: [PATCH] cafe_nand: remove busy-wait loop
On Dec 09 2008, at 16:05, Daniel Drake was caught saying: This patch enables interrupts for DMA and command completion events, and uses them to determine when commands and transfers have completed. This avoids a busy-wait loop which was a waste of CPU time. Signed-off-by: Daniel Drake [EMAIL PROTECTED] Looks good, but since we're just checking for one bit, I think we can we just use an atomic_t instead of needing a whole spinlock. ~Deepak -- Deepak Saxena - Kernel Developer, One Laptop Per Child _ __o (o ---\, Give One Laptop, Get One Laptop //\ - ( )/ ( ) http://www.amazon.com/xoV_/_ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: [PATCH] cafe_nand: remove busy-wait loop
On Tue, Dec 9, 2008 at 5:05 PM, Mitch Bradley [EMAIL PROTECTED] wrote: Nice! One quick test, for starters, is to do some file accesses to large files (dd or tar or whatever) and use the time command to look at the system and user times. With the patch, the system time should drop to a fraction of the real time. The system time does drop considerably, but unfortunately the transfer rate also decreases significantly. I think this is because many of the commands execute very quickly (1-10us), therefore it is wasteful to go and schedule a new process only to come back immediately after. Perhaps we can come up with a scheme that does a fast busy-loop for the quick commands, and then schedules for the longer ones. Here are the results anyway. Both experiments running on joyride-2579, jffs2 filesystem, and a 50mb randfile that I generated on another computer from /dev/urandom First experiment was echo 3 /proc/sys/vm/drop_caches; time dd if=randfile of=/dev/null that is, reading the file from the NAND, with caches cold. Repeated 3 times. Before: 52428800 bytes (52 MB) copied, 5.4499 s, 9.6 MB/s real0m7.324s sys 0m7.080s 52428800 bytes (52 MB) copied, 5.38514 s, 9.7 MB/s real0m5.452s sys 0m5.190s 52428800 bytes (52 MB) copied, 5.41167 s, 9.7 MB/s real0m5.480s sys 0m5.210s After: 52428800 bytes (52 MB) copied, 10.2002 s, 5.1 MB/s real0m24.593s sys 0m2.310s 52428800 bytes (52 MB) copied, 10.166 s, 5.2 MB/s real0m10.266s sys 0m0.910s 52428800 bytes (52 MB) copied, 9.72464 s, 5.4 MB/s real0m9.840s sys 0m1.210s So, the transfer rate dropped from about 9.6mb/sec to about 5.3mb/sec. However, sys time decreased by a factor of about 5. Also note that the first run in each case took significantly longer (especially in the after case, perhaps an anomaly) presumably because it had to load 'dd' from nand too. Second experiment: duplicate the file on disk, with hot caches so that we are measuring write (and not also read). rm copy; cat randfile /dev/null; time cp randfile copy repeated 3 times Before: real0m56.459s sys 0m51.280s real0m57.919s sys 0m51.300s real0m56.016s sys 0m50.970s After: real1m36.600s sys 0m7.480s real1m19.303s sys 0m6.980s real1m28.450s sys 0m6.930s Again, a considerable increase in real time, and a considerable decrease in sys time. I'll try and find some time to identify which commands are the ones that take long, and then try a patch that only avoids the busy-wait for those ones. Daniel ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel