Re: Looing for advice on profiling
On Wed, Nov 10, 2004 at 10:07:46AM +, Malcolm Wallace wrote: Simon Marlow [EMAIL PROTECTED] writes: On 09 November 2004 17:04, Duncan Coutts wrote: Are you using BinMem, or BinIO? BinIO Ah. BinIO is going to be a lot slower than BinMem, because it does an hPutChar for each character, whereas BinMem just writes into an array. I never really optimised the BinIO path, because we use BinMem exclusively in GHC. I have also done a port of the binary library to ghc6 as part of my ginsu project and done some work on improving its efficiency, in addition I have updated DrIFT such that it can derive both the old bitwise nhc style binary as well as the new ghc style byte based binary. (the byte based version which I use in ginsu is signifigantly faster). The code can be gotten from ginsu in http://repetae.net/computer/ginsu/ I also have a much improved PackedString based on raw UTF8 in memory with optimized unboxed folding routines which is designed to be very fast to serialize with Binary. In ginsu, a switch from String - PackedString changed my memory footprint from 200megs to 10. quite a nice improvement. John -- John Meacham - repetae.netjohn ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
RE: Looing for advice on profiling
On 09 November 2004 17:04, Duncan Coutts wrote: Are you using BinMem, or BinIO? BinIO Ah. BinIO is going to be a lot slower than BinMem, because it does an hPutChar for each character, whereas BinMem just writes into an array. I never really optimised the BinIO path, because we use BinMem exclusively in GHC. Cheers, Simon ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Looing for advice on profiling
Hi all, I'm looking for some advice on profiling and any suggestion on what might be going on with this program. I'm profiling c2hs, Manuel Chakravarty's FFI preprocessor with an eye to reducing its time and space usage. I'm also trying to add binary serialisation. c2hs basically works by parsing a load of C header files, building up maps of info on function type names. It then scans a .chs file and produces a .hs file using the info it gained from the C header files. Producing the .hs file is fairly quick once all the symbol maps have been constructed. Parsing the C headers and gathering the info is fairly slow. So building a project that uses lots of .chs files is quite slow because it re-parses the header files each time. We originally modified c2hs to do multiple .chs files in one go so the work could be shared however this style is not compatible with most build tools (ie automake). Another approach is to dump a representation of the symbol maps after parsing the header files and then reading this back for each .chs file. So enough background, here's the issues: doing the binary serialisation seems to be incredibly slow and I can't figure out why. I'm now using the Binary module pinched from the ghc sources and instances derived by DrIFT for all the c2hs data structures. I've also derived NFData instances (DeepSeq) so that I can make sure that the data to be serialised is already fully evaluated. I inserted timing points in the program so see what's taking so long. Here's a sample output: c2hs/c2hs +RTS -M400m -H300m -RTS -C-I/usr/include/glib-2.0 -C-I/usr/lib/glib-2.0/include --precomp=glib.precomp glib.h starting c2hs elapsed time: 0.00 (start) elapsed time: 0.00 (about to run cpp) elapsed time: 0.01 (about to parse header) elapsed time: 1.90 (about to emit warnings) elapsed time: 1.90 (about to deepSeq header) elapsed time: 5.24 (about to serialise header) elapsed time: 52.47 (finnished serialising header) elapsed time: 52.47 (finnish) glib.h (and it's includes after cpp'ing) is about 3k lines. So it takes a few seconds to parse the file and a few more to deepSeq the whole structure. However after that it takes a long time to serialise the structure. glib.h is quite small in comparison to gtk/gtk.h which is about 25k lines. So gtk.h takes ages. In fact it uses so much memory serialising that it cannot complete on my machine with -H450m (the max I can set it to on my 512Mb machine without swap thrashing). When I run it in the original mode that does no serialisation it completes fine in about 1min. The binary file produced for glib.h in the end is about 4Mb which is not huge I'd say. We're using String rather than FastString so that slows things down a bit and takes more space but even so. (I modified the Binary module to store lists in length + elems format rather than 0 for [], 1 for (:) format and chars to use just one byte so strings are not stored too inefficiently). The file size for gtk.h is about 20Mb (tested on a machine with more memory). When I do time profiling, the big cost centres come up as putByte and putWord. When I profile for space it shows the large FiniteMaps dominating most everything else. I originally guessed from that that the serialisation must be forcing loads of thunks which is why it shows up so highly on the profile. However even after doing the deepSeq before serialisation, it takes a great deal of time, so I'm not sure what's going on. The retainer profiling again shows that the FiniteMaps are holding on to most stuff. A major problem no doubt is space use. For the large gtk/gtk.h, when I run with +RTS -B to get a beep every major garbage collection, the serialisation phase beeps continuously while the file grows. Occasionally it seems to freeze for 10s of seconds, not dong any garbage collection and not doing any file output but using 100% CPU, then it carries on outputting and garbage collecting furiously. I don't know how to work out what's going on when it does that. I don't understand how it can be generating so much garbage when it is doing the serialisation stuff on a structure that has already been fully deepSeq'ed. Any ideas? Where should I direct my investigation? I can supply tarballs and profiling results and graphs if that would be helpful. Unfortunately the code base is quite big which makes things difficult. Duncan ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: Looing for advice on profiling
Duncan Coutts [EMAIL PROTECTED] writes: I'm looking for some advice on profiling and any suggestion on what might be going on with this program. One suggestion might be to serialise (key,value) pairs to file as they are first encountered, rather than waiting until they are all inside FiniteMaps. That would eliminate the time you are currently spending on lookups. (A subsequent run would then need to do the insertion of binary (key,value)s, rather than having them already ordered, but at least you save the textual parsing cost there.) A major problem no doubt is space use. For the large gtk/gtk.h, when I run with +RTS -B to get a beep every major garbage collection, the serialisation phase beeps continuously while the file grows. Occasionally it seems to freeze for 10s of seconds, not dong any garbage collection and not doing any file output but using 100% CPU, then it carries on outputting and garbage collecting furiously. I don't know how to work out what's going on when it does that. One guess might be generational collection: fast beeps are for the current generation, pauses are older generations? Regards, Malcolm ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
RE: Looing for advice on profiling
On 09 November 2004 12:54, Duncan Coutts wrote: [snip] When I do time profiling, the big cost centres come up as putByte and putWord. When I profile for space it shows the large FiniteMaps dominating most everything else. I originally guessed from that that the serialisation must be forcing loads of thunks which is why it shows up so highly on the profile. However even after doing the deepSeq before serialisation, it takes a great deal of time, so I'm not sure what's going on. let's get the simple things out of the way first: make sure you're compiling Binary with -O -funbox-strict-fields (very important). When compiling for profiling, don't compile Binary with -auto-all, because that will add cost centres to all the small functions and really skew the profile. I find this is a good rule of thumb when profiling: avoid -auto-all on your low-level libraries that you hope to be inlined a lot. You say your instances are created using DrIFT - I don't think we ever modified DrIFT to generate the right kind of instances for the Binary library in GHC, so are you using the instances designed for the nhc98 binary library? If so, make sure your instances are using put_ rather than put, because the former will allow binary output to run in constant stack space. Are you using BinMem, or BinIO? The retainer profiling again shows that the FiniteMaps are holding on to most stuff. A major problem no doubt is space use. For the large gtk/gtk.h, when I run with +RTS -B to get a beep every major garbage collection, the serialisation phase beeps continuously while the file grows. Occasionally it seems to freeze for 10s of seconds, not dong any garbage collection and not doing any file output but using 100% CPU, then it carries on outputting and garbage collecting furiously. I don't know how to work out what's going on when it does that. I agree with Malcolm's conjecture: it sounds like a very long major GC pause. I don't understand how it can be generating so much garbage when it is doing the serialisation stuff on a structure that has already been fully deepSeq'ed. Yes, binary output *should* do zero allocation, and binary input should only allocate the structure being created. The Binary library is quite heavily tuned so that this is the case (if you compile with profiling and -auto-all, it will almost certainly break this property, though). Cheers, Simon ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
RE: Looing for advice on profiling
On Tue, 2004-11-09 at 14:45, Simon Marlow wrote: On 09 November 2004 12:54, Duncan Coutts wrote: [snip] When I do time profiling, the big cost centres come up as putByte and putWord. When I profile for space it shows the large FiniteMaps dominating most everything else. I originally guessed from that that the serialisation must be forcing loads of thunks which is why it shows up so highly on the profile. However even after doing the deepSeq before serialisation, it takes a great deal of time, so I'm not sure what's going on. let's get the simple things out of the way first: make sure you're compiling Binary with -O -funbox-strict-fields (very important). When compiling for profiling, don't compile Binary with -auto-all, because that will add cost centres to all the small functions and really skew the profile. I find this is a good rule of thumb when profiling: avoid -auto-all on your low-level libraries that you hope to be inlined a lot. Ok, I was missing -funbox-strict-fields. I'll try that. You say your instances are created using DrIFT - I don't think we ever modified DrIFT to generate the right kind of instances for the Binary library in GHC, so are you using the instances designed for the nhc98 binary library? If so, make sure your instances are using put_ rather than put, because the former will allow binary output to run in constant stack space. It's using put_ Are you using BinMem, or BinIO? BinIO The retainer profiling again shows that the FiniteMaps are holding on to most stuff. A major problem no doubt is space use. For the large gtk/gtk.h, when I run with +RTS -B to get a beep every major garbage collection, the serialisation phase beeps continuously while the file grows. Occasionally it seems to freeze for 10s of seconds, not dong any garbage collection and not doing any file output but using 100% CPU, then it carries on outputting and garbage collecting furiously. I don't know how to work out what's going on when it does that. I agree with Malcolm's conjecture: it sounds like a very long major GC pause. Right, ok. I don't understand how it can be generating so much garbage when it is doing the serialisation stuff on a structure that has already been fully deepSeq'ed. Yes, binary output *should* do zero allocation, and binary input should only allocate the structure being created. The Binary library is quite heavily tuned so that this is the case (if you compile with profiling and -auto-all, it will almost certainly break this property, though). Yes, it's much better with optimisations. I'll try the -funbox-strict-fields and report back. Duncan ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users