yeah I think putting gzip or something in the loop is A LOT easier :-)
On Thu, Mar 26, 2015 at 6:29 PM, John Camara <john.m.cam...@gmail.com> wrote: > Hi Fijal, > > To recap and continue the discussion from irc. > > We already discussed that the stack id are based on a counter which is good > but I also want to confirm that the ids have locality associated with the > code. That is similar areas of the code will have similar ids. Just to > make sure are not random with respect to the code otherwise compression will > not be helpful. If the ids are random that would need to be corrected > first. > > Right now the stack traces are written to the file repeating the following > sequence > > MARKER_STACKTRACE > count > depth > stack > stack > ... > stack > > In order to get a high compression ratio it would be better to combine > multiple stacktraces and rearrange the data as follows > > MARKER_COMPRESSED_STACKTRACES > counts_compressed_length > counts_compressed > depths_compressed_length > depths_compressed > stacks_compressed_length > stacks_compressed > > In order to build the compress data you will want to 3 pairs of 2 buffers. > A pair of buffers for counts, depths, and stacks. Your profiller would be > writing to one set of buffers and another thread would be responsible for > compressing buffers that are full and writing them to the file. Once a set > of buffers are full the profiller would start filling up the other set of > buffers. > > For each set of buffers you need a variable to hold the previous count, > depth, and stack id. They will be initialized to 0 before any data is > written to an empty buffer. In stead of writing the actual count value into > the counts buffer you will write the difference between the current count > and the previous count. The reason for doing this is that the delta values > will mostly be around 0 which will significantly improve the compression > ratio without adding much overhead. Of course you would do the same for > depths and stack ids. > > When you compress the data you compress each buffer individually to make > sure like data is being compressed. Like data compresses better the unlike > data and by saving deltas very few bits will be required to represent the > data and you are likely to have long strings of 0s and 1s. > > I'm sure now you can see why I don't want stack ids being random. As if > they are random then the deltas will be all over the place so you wont end > up with long strings of 0s and 1s and random data itself does not compress. > > To test this out I wouldn't bother modifying the c code but instead try it > out in Python to first make sure the compression is providing huge gains and > figure out how to tune the algorithm without having to mess with the signal > handlers and writing the code for the separate thread and dealing issues > such as making sure you don't start writing to a buffer before the thread > finished writing the data to the file, etc. I would just read an existing > profile file and rewrite it to a different file by rearranging the data and > compressing the delta as I described. You can get away with one set of > buffers as you wouldn't be profiling at the same time. > > To tune this process you will need to determine the appropriate number of > stack traces that is small enough to keep memory down but large enough so > that the overhead associated with compression small. Maybe start of with > about 8000 stack traces. I would try gzip, bz2, and lzma and look at their > compression ratios and times. Gzip is general faster than bz2 and lzma is > the slowest. On the other hand lzma provides the best compression and gzip > the worse. Since you will be compressing deltas you most likely can get > away with using the fastest compression options under each compressor and > not effect the compression ratio. But I would test it to verify this as it > does depend on the data being compressed whether or not this is true. Also > one option that is available in lzma is the ability to set the width of the > data to look at when looking for patterns. Since you are saving 32 or 64 > bit ints depending on the platform you can set the option to either 4 or 8 > bytes based on the platform. I don't believe qzip or bz2 have this option. > By setting this option in lzma you will likely improve the compression > ratio. > > You may find that counts and depths give similar compression, between the 3 > compression types in which case just use the fastest which will likely be > gzip. On the other hand maybe the stack ids will be better off using lzma. > This is also another reason to separate out, like data, as it gives you an > option to use the fastest compressors for some data types while using others > to provide for better compression. > > I would not be surprised if this approach achieves a compression ratio > better than 100x but that will be heavily dependent on how local the stack > ids are. Also don't forget about simple things like not using 64 bit ints > when you can get away with smaller ones. > > Also for a slight variation to the above. If you find most of your deltas > are < 127 you could write them out as 1 byte and when greater than 127 write > them out as a 4 byte int with the high bit set. If you do this then don't > set the lzma option to 4 or 8 byte boundaries as now your data is a mixture > of 1 and 4 byte values. This sometimes can provide huge reductions in > compression times without much effect on the overall compression ratio. > > Hope you find this helpful. > > John > > _______________________________________________ > pypy-dev mailing list > pypy-dev@python.org > https://mail.python.org/mailman/listinfo/pypy-dev > _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev