On May 5, 2014, at 5:02 PM, Rafael Schloming <r...@alum.mit.edu> wrote:
> Hi Clebert, > > Sorry for taking so long to follow up with the benchmark. I've been tweaking > it and using it to do some memory and CPU profiling. > > The good news is I was able to significantly reduce memory utilization by > making some simple changes to CollectorImpl. The benchmark triggers over 42 > million events. The way CollectorImpl was coded initially this would create > two throwaway objects for each event. This was ending up somewhere north of > 84 million throw away objects over the course of the benchmark. One being the > event itself, and the other being the linked list node. I changed > CollectorImpl to use a simple chain instead of java.util's linked list and > also to pool/reuse popped events. The same benchmark now only results in > about 250 actual event objects being created in total in order to handle the > same 42 million events. > > While this reduces memory pressure a lot, surprisingly enough, the event > related objects were not the biggest source of garbage. At the top is > java.nio.HeapByteBuffer (I haven't looked into this, but I suspect we're > throwing away some of these that we could be reusing), and the second biggest > source of garbage is org.apache.qpid.proton.amqp.UnsignedInteger. The latter > one I find a bit more concerning as its use is fairly interwingled into the > design of the current codec and as a result it is likely less straightforward > to address. > > On the CPU profiling front I've noticed a couple of interesting things. First > off it does appear that much of the processing time is spent in codec, > roughly 63%, but the cost is not symmetrically divided between encode and > decode. Encoding accounts for about 46% of the total, and decoding about 17%. > I think this may be why its hard to measure the effect of your patch on this > benchmark. The decoding cost just isn't all that high compared to encoding. I > did the same profiling again with your patch applied and decoding dropped to > about 10% of the total while encoding increased to about 50% of the total. > > Digging into the encoding side a bit, it appears that a significant amount of > time is being spent calculating the encoded size of a value prior to writing > its encoded representation to the wire. One of the optimizations I've had > success with in the past (both on previous Java codecs and in the C codec) is > to avoid calculating the size up front and instead simply reserve the > necessary space for it and fill it in after the encoded representation has > been written. In the past this has close to doubled the performance of encode > since calculating the encoded size is often as expensive as simply doing the > encoding. Unfortunately I'm guessing this kind of thing would probably > require a major rework of the codec. > > To summarize I think there are really two design related issues we will need > to address in order to achieve optimum performance. On the memory front, I > think the fact that every described type is rendered into a tree of generic > objects on both decode/encode is going to be problematic. The strategy you've > taken in your patch to special case certain frames and eliminate the > intermediate list objects helps with this, but I think we could do a whole > lot better if we were to adopt a design that did not require any intermediate > objects at all. On the CPU front, I think we'll get the biggest bang for our > buck if we look into a design that doesn't require calculating the size up > front. > > I have some ideas in mind for a new design that I hope will address both of > these issues. I'm going to write them up in a separate post. > > Regarding your patch, I'm happy to apply it, but I suspect that much of the > current codec layer would need to be modified and/or replaced to address the > above findings. Let me know how you would like to proceed. It's all consistent with what I have seen... I have also realized the calculating encode size prior to sending. I would prefer if we could evolve in top of what did. I think the patch that I did is one step further on avoiding intermediate objects... I have seen already other cases where we could avoid it... It's all an evolution... and I'm still working into other cases... if you take this patch now we would move a step further. the patch is not only addressing the codec optimizations, but it's also avoiding multiple instances of the codec itself.. what makes it lighter. I'm now working on a benchmark based on yours that will be closer to what I will be using.