On May 5, 2014, at 5:02 PM, Rafael Schloming <r...@alum.mit.edu> wrote:

> Hi Clebert,
> 
> Sorry for taking so long to follow up with the benchmark. I've been tweaking 
> it and using it to do some memory and CPU profiling.
> 
> The good news is I was able to significantly reduce memory utilization by 
> making some simple changes to CollectorImpl. The benchmark triggers over 42 
> million events. The way CollectorImpl was coded initially this would create 
> two throwaway objects for each event. This was ending up somewhere north of 
> 84 million throw away objects over the course of the benchmark. One being the 
> event itself, and the other being the linked list node. I changed 
> CollectorImpl to use a simple chain instead of java.util's linked list and 
> also to pool/reuse popped events. The same benchmark now only results in 
> about 250 actual event objects being created in total in order to handle the 
> same 42 million events.
> 
> While this reduces memory pressure a lot, surprisingly enough, the event 
> related objects were not the biggest source of garbage. At the top is 
> java.nio.HeapByteBuffer (I haven't looked into this, but I suspect we're 
> throwing away some of these that we could be reusing), and the second biggest 
> source of garbage is org.apache.qpid.proton.amqp.UnsignedInteger. The latter 
> one I find a bit more concerning as its use is fairly interwingled into the 
> design of the current codec and as a result it is likely less straightforward 
> to address.
> 
> On the CPU profiling front I've noticed a couple of interesting things. First 
> off it does appear that much of the processing time is spent in codec, 
> roughly 63%, but the cost is not symmetrically divided between encode and 
> decode. Encoding accounts for about 46% of the total, and decoding about 17%. 
> I think this may be why its hard to measure the effect of your patch on this 
> benchmark. The decoding cost just isn't all that high compared to encoding. I 
> did the same profiling again with your patch applied and decoding dropped to 
> about 10% of the total while encoding increased to about 50% of the total.
> 
> Digging into the encoding side a bit, it appears that a significant amount of 
> time is being spent calculating the encoded size of a value prior to writing 
> its encoded representation to the wire. One of the optimizations I've had 
> success with in the past (both on previous Java codecs and in the C codec) is 
> to avoid calculating the size up front and instead simply reserve the 
> necessary space for it and fill it in after the encoded representation has 
> been written. In the past this has close to doubled the performance of encode 
> since calculating the encoded size is often as expensive as simply doing the 
> encoding. Unfortunately I'm guessing this kind of thing would probably 
> require a major rework of the codec.
> 
> To summarize I think there are really two design related issues we will need 
> to address in order to achieve optimum performance. On the memory front, I 
> think the fact that every described type is rendered into a tree of generic 
> objects on both decode/encode is going to be problematic. The strategy you've 
> taken in your patch to special case certain frames and eliminate the 
> intermediate list objects helps with this, but I think we could do a whole 
> lot better if we were to adopt a design that did not require any intermediate 
> objects at all. On the CPU front, I think we'll get the biggest bang for our 
> buck if we look into a design that doesn't require calculating the size up 
> front.
> 
> I have some ideas in mind for a new design that I hope will address both of 
> these issues. I'm going to write them up in a separate post.
> 
> Regarding your patch, I'm happy to apply it, but I suspect that much of the 
> current codec layer would need to be modified and/or replaced to address the 
> above findings. Let me know how you would like to proceed.



It's all consistent with what I have seen... I have also realized the 
calculating encode size prior to sending.


I would prefer if we could evolve in top of what did. I think the patch that I 
did is one step further on avoiding intermediate objects...   I have seen 
already other cases where we could avoid it...  It's all an evolution... and 
I'm still working into other cases...  if you take this patch now we would move 
a step further.

the patch is not only addressing the codec optimizations, but it's also 
avoiding multiple instances of the codec itself.. what makes it lighter.

I'm now working on a benchmark based on yours that will be closer to what I 
will be using.

Reply via email to