Interesting info. The > order of magnitude difference in time between BaseList::remove & BaseList::removenth suggests the possibility that the for loop in BaseList::remove is falling off the end in many cases (i.e. attempting to remove an item that doesn't exist). Maybe thats whats broken.
On Fri, Oct 6, 2017 at 3:49 PM, Azoff, Justin S <jaz...@illinois.edu> wrote: > > > On Oct 6, 2017, at 5:59 PM, Jim Mellander <jmellan...@lbl.gov> wrote: > > > > I particularly like the idea of an allocation pool that per-packet > information can be stored, and reused by the next packet. > > > > There also are probably some optimizations of frequent operations now > that we're in a 64-bit world that could prove useful - the one's complement > checksum calculation in net_util.cc is one that comes to mind, especially > since it works effectively a byte at a time (and works with even byte > counts only). Seeing as this is done per-packet on all tcp payload, > optimizing this seems reasonable. Here's a discussion of do the checksum > calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ > - this website also has an x64 allocator that is claimed to be faster than > tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml (note: > I haven't tried anything from this source, but find it interesting). > > > > I'm guessing there are a number of such "small" optimizations that could > provide significant performance gains. > > > > Take care, > > > > Jim > > I've been messing around with 'perf top', the one's complement function > often shows up fairly high up.. that, PriorityQueue::BubbleDown, and > BaseList::remove > > Something (on our configuration?) is doing a lot of > PQ_TimerMgr::~PQ_TimerMgr... I don't think I've come across that class > before in bro.. I think a script may be triggering something that is > hurting performance. I can't think of what it would be though. > > Running perf top on a random worker right now with -F 19999 shows: > > Samples: 485K of event 'cycles', Event count (approx.): 26046568975 > Overhead Shared Object Symbol > 34.64% bro [.] BaseList::remove > 3.32% libtcmalloc.so.4.2.6 [.] operator delete > 3.25% bro [.] PriorityQueue::BubbleDown > 2.31% bro [.] BaseList::remove_nth > 2.05% libtcmalloc.so.4.2.6 [.] operator new > 1.90% bro [.] Attributes::FindAttr > 1.41% bro [.] Dictionary::NextEntry > 1.27% libc-2.17.so [.] __memcpy_ssse3_back > 0.97% bro [.] StmtList::Exec > 0.87% bro [.] Dictionary::Lookup > 0.85% bro [.] NameExpr::Eval > 0.84% bro [.] BroFunc::Call > 0.80% libtcmalloc.so.4.2.6 [.] tc_free > 0.77% libtcmalloc.so.4.2.6 [.] operator delete[] > 0.70% bro [.] ones_complement_checksum > 0.60% libtcmalloc.so.4.2.6 [.] tcmalloc::ThreadCache:: > ReleaseToCentralCache > 0.60% bro [.] RecordVal::RecordVal > 0.53% bro [.] UnaryExpr::Eval > 0.51% bro [.] ExprStmt::Exec > 0.51% bro [.] iosource::Manager::FindSoonest > 0.50% libtcmalloc.so.4.2.6 [.] operator new[] > > > Which sums up to 59.2% > > BaseList::remove/BaseList::remove_nth seems particularly easy to > optimize. Can't that loop be replaced by a memmove? > I think something may be broken if it's being called that much though. > > > > — > Justin Azoff > >
_______________________________________________ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev