I have some ideas as well:
- Calculating size prior to sending:
- We could write zeroes, write to the buffer... come back to the previous
position.. write the size instead of calculating it.
I have read this code a lot.. and I wouldn't rewrite the code.. just optimize
these cases... I wouldn't optimize it for every possible case TBH.. just on
message Delivery and Settling unless you want to also optimize other cases for
use cases that I'm not aware at the moment.
other things that could boost performance based on the micro benchmark I wrote:
- Using Integer, Long.. etc..inside of UnsignedInt, UnsignedLong would give you
a good boost in performance. The JDK is already optimized to box these types...
while UnsignedInt, UnsignedLong.. etc.. its not well optimized.
- Reusing buffers.. maybe adding a framework where we could reuse buffers.. or
delegate into other frameworks (e.g. Netty).
On May 5, 2014, at 5:40 PM, Clebert Suconic <csuco...@redhat.com> wrote:
> On May 5, 2014, at 5:02 PM, Rafael Schloming <r...@alum.mit.edu> wrote:
>> Hi Clebert,
>> Sorry for taking so long to follow up with the benchmark. I've been tweaking
>> it and using it to do some memory and CPU profiling.
>> The good news is I was able to significantly reduce memory utilization by
>> making some simple changes to CollectorImpl. The benchmark triggers over 42
>> million events. The way CollectorImpl was coded initially this would create
>> two throwaway objects for each event. This was ending up somewhere north of
>> 84 million throw away objects over the course of the benchmark. One being
>> the event itself, and the other being the linked list node. I changed
>> CollectorImpl to use a simple chain instead of java.util's linked list and
>> also to pool/reuse popped events. The same benchmark now only results in
>> about 250 actual event objects being created in total in order to handle the
>> same 42 million events.
>> While this reduces memory pressure a lot, surprisingly enough, the event
>> related objects were not the biggest source of garbage. At the top is
>> java.nio.HeapByteBuffer (I haven't looked into this, but I suspect we're
>> throwing away some of these that we could be reusing), and the second
>> biggest source of garbage is org.apache.qpid.proton.amqp.UnsignedInteger.
>> The latter one I find a bit more concerning as its use is fairly
>> interwingled into the design of the current codec and as a result it is
>> likely less straightforward to address.
>> On the CPU profiling front I've noticed a couple of interesting things.
>> First off it does appear that much of the processing time is spent in codec,
>> roughly 63%, but the cost is not symmetrically divided between encode and
>> decode. Encoding accounts for about 46% of the total, and decoding about
>> 17%. I think this may be why its hard to measure the effect of your patch on
>> this benchmark. The decoding cost just isn't all that high compared to
>> encoding. I did the same profiling again with your patch applied and
>> decoding dropped to about 10% of the total while encoding increased to about
>> 50% of the total.
>> Digging into the encoding side a bit, it appears that a significant amount
>> of time is being spent calculating the encoded size of a value prior to
>> writing its encoded representation to the wire. One of the optimizations
>> I've had success with in the past (both on previous Java codecs and in the C
>> codec) is to avoid calculating the size up front and instead simply reserve
>> the necessary space for it and fill it in after the encoded representation
>> has been written. In the past this has close to doubled the performance of
>> encode since calculating the encoded size is often as expensive as simply
>> doing the encoding. Unfortunately I'm guessing this kind of thing would
>> probably require a major rework of the codec.
>> To summarize I think there are really two design related issues we will need
>> to address in order to achieve optimum performance. On the memory front, I
>> think the fact that every described type is rendered into a tree of generic
>> objects on both decode/encode is going to be problematic. The strategy
>> you've taken in your patch to special case certain frames and eliminate the
>> intermediate list objects helps with this, but I think we could do a whole
>> lot better if we were to adopt a design that did not require any
>> intermediate objects at all. On the CPU front, I think we'll get the biggest
>> bang for our buck if we look into a design that doesn't require calculating
>> the size up front.
>> I have some ideas in mind for a new design that I hope will address both of
>> these issues. I'm going to write them up in a separate post.
>> Regarding your patch, I'm happy to apply it, but I suspect that much of the
>> current codec layer would need to be modified and/or replaced to address the
>> above findings. Let me know how you would like to proceed.
> It's all consistent with what I have seen... I have also realized the
> calculating encode size prior to sending.
> I would prefer if we could evolve in top of what did. I think the patch that
> I did is one step further on avoiding intermediate objects... I have seen
> already other cases where we could avoid it... It's all an evolution... and
> I'm still working into other cases... if you take this patch now we would
> move a step further.
> the patch is not only addressing the codec optimizations, but it's also
> avoiding multiple instances of the codec itself.. what makes it lighter.
> I'm now working on a benchmark based on yours that will be closer to what I
> will be using.