snappy and lz4 are good algos to try too. On Fri, Mar 27, 2015 at 8:55 AM, Maciej Fijalkowski <fij...@gmail.com> wrote:
> yeah I think putting gzip or something in the loop is A LOT easier :-) > > On Thu, Mar 26, 2015 at 6:29 PM, John Camara <john.m.cam...@gmail.com> > wrote: > > Hi Fijal, > > > > To recap and continue the discussion from irc. > > > > We already discussed that the stack id are based on a counter which is > good > > but I also want to confirm that the ids have locality associated with the > > code. That is similar areas of the code will have similar ids. Just to > > make sure are not random with respect to the code otherwise compression > will > > not be helpful. If the ids are random that would need to be corrected > > first. > > > > Right now the stack traces are written to the file repeating the > following > > sequence > > > > MARKER_STACKTRACE > > count > > depth > > stack > > stack > > ... > > stack > > > > In order to get a high compression ratio it would be better to combine > > multiple stacktraces and rearrange the data as follows > > > > MARKER_COMPRESSED_STACKTRACES > > counts_compressed_length > > counts_compressed > > depths_compressed_length > > depths_compressed > > stacks_compressed_length > > stacks_compressed > > > > In order to build the compress data you will want to 3 pairs of 2 > buffers. > > A pair of buffers for counts, depths, and stacks. Your profiller would > be > > writing to one set of buffers and another thread would be responsible for > > compressing buffers that are full and writing them to the file. Once a > set > > of buffers are full the profiller would start filling up the other set of > > buffers. > > > > For each set of buffers you need a variable to hold the previous count, > > depth, and stack id. They will be initialized to 0 before any data is > > written to an empty buffer. In stead of writing the actual count value > into > > the counts buffer you will write the difference between the current count > > and the previous count. The reason for doing this is that the delta > values > > will mostly be around 0 which will significantly improve the compression > > ratio without adding much overhead. Of course you would do the same for > > depths and stack ids. > > > > When you compress the data you compress each buffer individually to make > > sure like data is being compressed. Like data compresses better the > unlike > > data and by saving deltas very few bits will be required to represent the > > data and you are likely to have long strings of 0s and 1s. > > > > I'm sure now you can see why I don't want stack ids being random. As if > > they are random then the deltas will be all over the place so you wont > end > > up with long strings of 0s and 1s and random data itself does not > compress. > > > > To test this out I wouldn't bother modifying the c code but instead try > it > > out in Python to first make sure the compression is providing huge gains > and > > figure out how to tune the algorithm without having to mess with the > signal > > handlers and writing the code for the separate thread and dealing issues > > such as making sure you don't start writing to a buffer before the thread > > finished writing the data to the file, etc. I would just read an > existing > > profile file and rewrite it to a different file by rearranging the data > and > > compressing the delta as I described. You can get away with one set of > > buffers as you wouldn't be profiling at the same time. > > > > To tune this process you will need to determine the appropriate number of > > stack traces that is small enough to keep memory down but large enough so > > that the overhead associated with compression small. Maybe start of with > > about 8000 stack traces. I would try gzip, bz2, and lzma and look at > their > > compression ratios and times. Gzip is general faster than bz2 and lzma > is > > the slowest. On the other hand lzma provides the best compression and > gzip > > the worse. Since you will be compressing deltas you most likely can get > > away with using the fastest compression options under each compressor and > > not effect the compression ratio. But I would test it to verify this as > it > > does depend on the data being compressed whether or not this is true. > Also > > one option that is available in lzma is the ability to set the width of > the > > data to look at when looking for patterns. Since you are saving 32 or 64 > > bit ints depending on the platform you can set the option to either 4 or > 8 > > bytes based on the platform. I don't believe qzip or bz2 have this > option. > > By setting this option in lzma you will likely improve the compression > > ratio. > > > > You may find that counts and depths give similar compression, between > the 3 > > compression types in which case just use the fastest which will likely be > > gzip. On the other hand maybe the stack ids will be better off using > lzma. > > This is also another reason to separate out, like data, as it gives you > an > > option to use the fastest compressors for some data types while using > others > > to provide for better compression. > > > > I would not be surprised if this approach achieves a compression ratio > > better than 100x but that will be heavily dependent on how local the > stack > > ids are. Also don't forget about simple things like not using 64 bit > ints > > when you can get away with smaller ones. > > > > Also for a slight variation to the above. If you find most of your > deltas > > are < 127 you could write them out as 1 byte and when greater than 127 > write > > them out as a 4 byte int with the high bit set. If you do this then > don't > > set the lzma option to 4 or 8 byte boundaries as now your data is a > mixture > > of 1 and 4 byte values. This sometimes can provide huge reductions in > > compression times without much effect on the overall compression ratio. > > > > Hope you find this helpful. > > > > John > > > > _______________________________________________ > > pypy-dev mailing list > > pypy-dev@python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > _______________________________________________ > pypy-dev mailing list > pypy-dev@python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -- Leonardo Santagada
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev