Re: jffs zlib tuning
On 08/01/08 17:09 -0800, William Fisher wrote: Jordan Crouse wrote: On 08/01/08 12:06 -0500, Bernardo Innocenti wrote: (cc CP, aleph) David Woodhouse wrote: 1. Did anybody profile the kernel while reading files? Last thing I red on this list is that the profiler does not work on the XO in kernel mode. Did anybody fix that I believe that standard kernel profiling (on timer ticks) has always worked, and even continues to work even though we use a tickless kernel now. I think oprofile also works. oprofile works, but for some reason it cannot generate call graphs. It vaguely remember that the problem was that on the Geode we were using sw timers rather than NMI as a timing source. Right - but that should only prevent us from benchmarking within interrupts in the kernel - it shouldn't have any effect on our userland benchmarking. I'm no oprofile expert (I couldn't get it working at all when I tried it the other day), but do you have the debug version of libc loaded too? Maybe it can't find the symbols. Jordan If you have NMI interrupts selected for Oprofile, you can also get samples from the other lower level interrupt handlers. Since OProfile can be run in either NMI interrupts or normal timer based interrupts. We can't use NMI, because we have no mechanism for causing the NMIs. Modern processors such as the k8 use registers called event counters to count a number of events between sample periods (events being some processor quality like number of instructions executed or number of cache misses). These event counters can be set up in such a way that they cause a NMI when the counter rolls over. This is how oprofile takes advantage of the silicon. The kicker is that even though the Geode has event counters, we cannot set them up to cause a NMI. So, with that mechanism lost, we're stuck with the timer tick. This has been discussed several times over the lifetime of the project - you can go back over the archives of the list and see our past conclusions on this matter. Jordan ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
On Jan 9, 2008 10:51 AM, Jordan Crouse [EMAIL PROTECTED] wrote: We can't use NMI, because we have no mechanism for causing the NMIs. Modern processors such as the k8 use registers called event counters to count a number of events between sample periods (events being some processor quality like number of instructions executed or number of cache misses). These event counters can be set up in such a way that they cause a NMI when the counter rolls over. This is how oprofile takes advantage of the silicon. The kicker is that even though the Geode has event counters, we cannot set them up to cause a NMI. So, with that mechanism lost, we're stuck with the timer tick. This has been discussed several times over the lifetime of the project - you can go back over the archives of the list and see our past conclusions on this matter. There is still the possibility of using the MFGPT block in 5536 to cause periodic NMIs. That would be similar to oprofile's RTC or timer interrupt mode, but with support to profile interrupt handler code, and pretty-good granularity (14MHz-based) The perfmon-rollover style (not available on LX) is certainly the best way to do something like oprofile, but if you can make periodic NMIs fast enough, you should be able to get pretty close to similar quality. This completely new oprofile mode would be very specific to GX/LX/5536, but it would improve oprofile usefulness if someone wants to make the (pretty large) effort. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
On Mon, 2008-01-07 at 18:15 +0100, NoiseEHC wrote: This message is primarily written for Bernardo Innocenti but everybody with relevant knowledge is welcomed to give some insight. I have decided two months ago that will write an asm implementation for zlib inflate (decompression) since Mitch Bradley said that the read speed is 3MB/sec which is dominated by the decompression code. http://lists.laptop.org/pipermail/devel/2007-November/007527.html Since then I went through the pain of installing linux in VirtualPC, compiling the code in linux and ended up with a kernel module which can test zlib code finally (took a month of my spare time, if I would have known this in advance I would not have started...). Now I understand the zlib code but need some info before acting on wrong assumptions: 1. Did anybody profile the kernel while reading files? Last thing I red on this list is that the profiler does not work on the XO in kernel mode. Did anybody fix that I believe that standard kernel profiling (on timer ticks) has always worked, and even continues to work even though we use a tickless kernel now. I think oprofile also works. 2. How does the file reading work? As I imagine the flash is read by DMA and the resulting data is uncompressed to a buffer. Is it correct?. Yes, that's right. Is the decompressed data gets copied to the target location or does it gets decompressed to their final place? http://dev.laptop.org/git?p=olpc-2.6;a=blob;f=fs/jffs2/read.c#l81 If it is copied, did somebody profile how much time it takes? These questions are important to know how much L2 cache is trashed in the process and which data needs prefetching. That hasn't been profiled specifically, no. 3. How long is the average data length which jffs2 uses for calling inflate? The maximum length of uncompressed data is 4KiB. The mean is probably slightly less than that. You could instrument the jffs2dump program from mtd-utils to give you more accurate answers. If you want to get involved with compression, see http://www.inf.u-szeged.hu/jffs2/bbc.php -- dwmw2 ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
(cc CP, aleph) David Woodhouse wrote: 1. Did anybody profile the kernel while reading files? Last thing I red on this list is that the profiler does not work on the XO in kernel mode. Did anybody fix that I believe that standard kernel profiling (on timer ticks) has always worked, and even continues to work even though we use a tickless kernel now. I think oprofile also works. oprofile works, but for some reason it cannot generate call graphs. It vaguely remember that the problem was that on the Geode we were using sw timers rather than NMI as a timing source. Without call graphs, benchmarking our rendering stack has been somewhat harder. You'd find lots of time spent in generic functions such as memcpy(), without a clue why. -- \___/ |___| Bernardo Innocenti - http://www.codewiz.org/ \___\ One Laptop Per Child - http://www.laptop.org/ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
On 08/01/08 12:06 -0500, Bernardo Innocenti wrote: (cc CP, aleph) David Woodhouse wrote: 1. Did anybody profile the kernel while reading files? Last thing I red on this list is that the profiler does not work on the XO in kernel mode. Did anybody fix that I believe that standard kernel profiling (on timer ticks) has always worked, and even continues to work even though we use a tickless kernel now. I think oprofile also works. oprofile works, but for some reason it cannot generate call graphs. It vaguely remember that the problem was that on the Geode we were using sw timers rather than NMI as a timing source. Right - but that should only prevent us from benchmarking within interrupts in the kernel - it shouldn't have any effect on our userland benchmarking. I'm no oprofile expert (I couldn't get it working at all when I tried it the other day), but do you have the debug version of libc loaded too? Maybe it can't find the symbols. Jordan -- Jordan Crouse Systems Software Development Engineer Advanced Micro Devices, Inc. ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
Jordan Crouse wrote: On 08/01/08 12:06 -0500, Bernardo Innocenti wrote: (cc CP, aleph) David Woodhouse wrote: 1. Did anybody profile the kernel while reading files? Last thing I red on this list is that the profiler does not work on the XO in kernel mode. Did anybody fix that I believe that standard kernel profiling (on timer ticks) has always worked, and even continues to work even though we use a tickless kernel now. I think oprofile also works. oprofile works, but for some reason it cannot generate call graphs. It vaguely remember that the problem was that on the Geode we were using sw timers rather than NMI as a timing source. Right - but that should only prevent us from benchmarking within interrupts in the kernel - it shouldn't have any effect on our userland benchmarking. I'm no oprofile expert (I couldn't get it working at all when I tried it the other day), but do you have the debug version of libc loaded too? Maybe it can't find the symbols. Jordan If you have NMI interrupts selected for Oprofile, you can also get samples from the other lower level interrupt handlers. Since OProfile can be run in either NMI interrupts or normal timer based interrupts. Check out the oprofile driver for your kernel and check which .config option you have selected for your build. I use NMI based sampling with the AMD use it to look at the profiles in the various device driver interrupt handlers for both disk and ethernet controllers. Check out: ../kernel/linux-2.6.20/drivers/oprofile/timer_int.c This contains the ops vector initialization code, from which the timer_notify() procedure is called. The other code to look at is ../kernel/profile.c which contains the general profile support. -- Bill ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
NoiseEHC wrote: It is unfortunate that it was not discussed on this list. Anyway, at least I have learnt writing Linux drivers in the process... What compression can be achieved with LZO? Last time I used it only for real time compression (backup), and not because of the compression ratio. (I suppose that you have tested it.) Note that we compress 4KB blocks in jffs2. With small block sizes, algorithms with better compression tend to loose most of their advantage. We can't say much until we benchmark lzo vs gzip with small files. -- \___/ |___| Bernardo Innocenti - http://www.codewiz.org/ \___\ One Laptop Per Child - http://www.laptop.org/ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
NoiseEHC wrote: 4. Bernardo, as I imagine you are compiling kernels every day. If I send you a patch (normally a rewritten inffast.c), would you test it on a real XO machine, please? I don't build kernels every day, but I'd be delighted to test such an improvement. And if you make it clean enough, I'm sure it would have an easy way into the mainline kernel. However, I recommend hacking in libz first, making it work with gzip, and staert porting it to the kernel next step. Debugging and benchmarking in userspace is *so* much easier. Also, I'd be very surprised if such an obvious optimization hadn't been tried already in 20+ years of gzip. Try digging around: you may find that it's not worth it. -- \___/ |___| Bernardo Innocenti - http://www.codewiz.org/ \___\ One Laptop Per Child - http://www.laptop.org/ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
However, I recommend hacking in libz first, making it work with gzip, and staert porting it to the kernel next step. Debugging and benchmarking in userspace is *so* much easier. Ah, I tried it. Unfortunately the zlib in the kernel is a heavily modified zlib and I was not able to compile it in user space. So I have learnt writing kernel modules (took just 2 days, it is so simple compared to windows drivers that I still cannot believe it). And there is a comment in the kernel zlib that user-space support was removed... Also, I'd be very surprised if such an obvious optimization hadn't been tried already in 20+ years of gzip. Try digging around: you may find that it's not worth it. The optimization that I thought of is absolutely Geode specific. First it needs some prefetch, and secondly it has a lot of branches. The Geode has a very simple 1 bit branch predictor (it seems like that but not documented) so it can waste 20-40 cycles for every run (every length/distance code). I know that it is hard to create better code than a C compiler nowadays so I am sure that simply rewriting the code in asm would not speed things up (5 years ago LZO had several asm implementations for 486/586/686 but ironically all were slower than the compiler generated one.) Now that you have mentioned that jffs2 uses only 4K blocks, it can be possible that the bottleneck is not in inffast.c. Do you have ANY perf/profile data, please? All I would like to know whether the bottleneck lies in inffast or not: /* When large enough input and output buffers are supplied to inflate(), for example, a 16K input buffer and a 64K output buffer, more than 95% of the inflate execution time is spent in this routine. */ ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
On 7 Jan 2008, at 19:28, Benjamin M. Schwartz wrote: Nonetheless, I decided to benchmark the compressors on http://xs-dev.laptop.org/~cscott/olpc/streams/joyride/build1513/ devel_jffs2/xo-1-olpc-stream-joyride-build-1513-20080105_1602- devel_jffs2-tree.tar.bz2 Results: Uncompressed 732M LZO -1 (fast)345M LZO -9 (slow)298M Gzip (defaults) 269M Bzip2 (as downloaded)236M Note that fast vs. slow refers to compression speed. Decompression is the same speed in either case. These tests were run on the whole tar-file. As noted by Bernie, the differences would likely be smaller on 4KB blocks. I had some notes, which I can't now find, for a real-time sample logger we built... Anyway, we did some tests compressing small blocks, and I am sure we saw much less of a difference than Benjamin saw on his big block. Also, I'm pretty sure bzip2 fared very badly in our test, the small blocks didn't suit it at all (and it was slower, of course.) Of course, the statistics of our sample data was probably not representative, so YMMV... ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel
Re: jffs zlib tuning
Then it is the last nail to the coffin of asm-izing zlib. As I see the LZO code is cache friendly so there is no way to speed it up. I am sorry wasting you time. :) imm wrote: On 7 Jan 2008, at 19:28, Benjamin M. Schwartz wrote: Nonetheless, I decided to benchmark the compressors on http://xs-dev.laptop.org/~cscott/olpc/streams/joyride/build1513/ devel_jffs2/xo-1-olpc-stream-joyride-build-1513-20080105_1602- devel_jffs2-tree.tar.bz2 Results: Uncompressed 732M LZO -1 (fast)345M LZO -9 (slow)298M Gzip (defaults) 269M Bzip2 (as downloaded)236M Note that fast vs. slow refers to compression speed. Decompression is the same speed in either case. These tests were run on the whole tar-file. As noted by Bernie, the differences would likely be smaller on 4KB blocks. I had some notes, which I can't now find, for a real-time sample logger we built... Anyway, we did some tests compressing small blocks, and I am sure we saw much less of a difference than Benjamin saw on his big block. Also, I'm pretty sure bzip2 fared very badly in our test, the small blocks didn't suit it at all (and it was slower, of course.) Of course, the statistics of our sample data was probably not representative, so YMMV... ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel ___ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel