yeah I think putting gzip or something in the loop is A LOT easier :-)

On Thu, Mar 26, 2015 at 6:29 PM, John Camara <john.m.cam...@gmail.com> wrote:
> Hi Fijal,
>
> To recap and continue the discussion from irc.
>
> We already discussed that the stack id are based on a counter which is good
> but I also want to confirm that the ids have locality associated with the
> code.  That is similar areas of the code will have similar ids.  Just to
> make sure are not random with respect to the code otherwise compression will
> not be helpful.  If the ids are random that would need to be corrected
> first.
>
> Right now the stack traces are written to the file repeating the following
> sequence
>
> MARKER_STACKTRACE
> count
> depth
> stack
> stack
> ...
> stack
>
> In order to get a high compression ratio it would be better to combine
> multiple stacktraces and rearrange the data as follows
>
> MARKER_COMPRESSED_STACKTRACES
> counts_compressed_length
> counts_compressed
> depths_compressed_length
> depths_compressed
> stacks_compressed_length
> stacks_compressed
>
> In order to build the compress data you will want to 3 pairs of 2 buffers.
> A pair of buffers for counts, depths, and stacks.  Your profiller would be
> writing to one set of buffers and another thread would be responsible for
> compressing buffers that are full and writing them to the file.  Once a set
> of buffers are full the profiller would start filling up the other set of
> buffers.
>
> For each set of buffers you need a variable to hold the previous count,
> depth, and stack id.  They will be initialized to 0 before any data is
> written to an empty buffer.  In stead of writing the actual count value into
> the counts buffer you will write the difference between the current count
> and the previous count.  The reason for doing this is that the delta values
> will mostly be around 0 which will significantly improve the compression
> ratio without adding much overhead.  Of course you would do the same for
> depths and stack ids.
>
> When you compress the data you compress each buffer individually to make
> sure like data is being compressed.  Like data compresses better the unlike
> data and by saving deltas very few bits will be required to represent the
> data and you are likely to have long strings of 0s and 1s.
>
> I'm sure now you can see why I don't want stack ids being random.  As if
> they are random then the deltas will be all over the place so you wont end
> up with long strings of 0s and 1s and random data itself does not compress.
>
> To test this out I wouldn't bother modifying the c code but instead try it
> out in Python to first make sure the compression is providing huge gains and
> figure out how to tune the algorithm without having to mess with the signal
> handlers and writing the code for the separate thread and dealing issues
> such as making sure you don't start writing to a buffer before the thread
> finished writing the data to the file, etc.  I would just read an existing
> profile file and rewrite it to a different file by rearranging the data and
> compressing the delta as I described.  You can get away with one set of
> buffers as you wouldn't be profiling at the same time.
>
> To tune this process you will need to determine the appropriate number of
> stack traces that is small enough to keep memory down but large enough so
> that the overhead associated with compression small.  Maybe start of with
> about 8000 stack traces.  I would try gzip, bz2, and lzma and look at their
> compression ratios and times.  Gzip is general faster than bz2 and lzma is
> the slowest.  On the other hand lzma provides the best compression and gzip
> the worse.  Since you will be compressing deltas you most likely can get
> away with using the fastest compression options under each compressor and
> not effect the compression ratio.  But I would test it to verify this as it
> does depend on the data being compressed whether or not this is true.  Also
> one option that is available in lzma is the ability to set the width of the
> data to look at when looking for patterns.  Since you are saving 32 or 64
> bit ints depending on the platform you can set the option to either 4 or 8
> bytes based on the platform.  I don't believe qzip or bz2 have this option.
> By setting this option in lzma you will likely improve the compression
> ratio.
>
> You may find that counts and depths give similar compression, between the 3
> compression types in which case just use the fastest which will likely be
> gzip.  On the other hand maybe the stack ids will be better off using lzma.
> This is also another reason to separate out, like data, as it gives you an
> option to use the fastest compressors for some data types while using others
> to provide for better compression.
>
> I would not be surprised if this approach achieves a compression ratio
> better than 100x but that will be heavily dependent on how local the stack
> ids are.  Also don't forget about simple things like not using 64 bit ints
> when you can get away with smaller ones.
>
> Also for a slight variation to the above.  If you find most of your deltas
> are < 127 you could write them out as 1 byte and when greater than 127 write
> them out as a 4 byte int with the high bit set.  If you do this then don't
> set the lzma option to 4 or 8 byte boundaries as now your data is a mixture
> of 1 and 4 byte values.  This sometimes can provide huge reductions in
> compression times without much effect on the overall compression ratio.
>
> Hope you find this helpful.
>
> John
>
> _______________________________________________
> pypy-dev mailing list
> pypy-dev@python.org
> https://mail.python.org/mailman/listinfo/pypy-dev
>
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to