> On Oct 6, 2017, at 5:59 PM, Jim Mellander <jmellan...@lbl.gov> wrote:
> 
> I particularly like the idea of an allocation pool that per-packet 
> information can be stored, and reused by the next packet.
> 
> There also are probably some optimizations of frequent operations now that 
> we're in a 64-bit world that could prove useful - the one's complement 
> checksum calculation in net_util.cc is one that comes to mind, especially 
> since it works effectively a byte at a time (and works with even byte counts 
> only).  Seeing as this is done per-packet on all tcp payload, optimizing this 
> seems reasonable.  Here's a discussion of do the checksum calc in 64-bit 
> arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website 
> also has an x64 allocator that is claimed to be faster than tcmalloc, see: 
> https://locklessinc.com/benchmarks_allocator.shtml  (note: I haven't tried 
> anything from this source, but find it interesting).
> 
> I'm guessing there are a number of such "small" optimizations that could 
> provide significant performance gains.
> 
> Take care,
> 
> Jim

I've been messing around with 'perf top', the one's complement function often 
shows up fairly high up.. that, PriorityQueue::BubbleDown, and BaseList::remove

Something (on our configuration?) is doing a lot of 
PQ_TimerMgr::~PQ_TimerMgr... I don't think I've come across that class before 
in bro.. I think a script may be triggering something that is hurting 
performance.  I can't think of what it would be though.

Running perf top on a random worker right now with -F 19999 shows:

Samples: 485K of event 'cycles', Event count (approx.): 26046568975
Overhead  Shared Object                 Symbol
  34.64%  bro                           [.] BaseList::remove
   3.32%  libtcmalloc.so.4.2.6          [.] operator delete
   3.25%  bro                           [.] PriorityQueue::BubbleDown
   2.31%  bro                           [.] BaseList::remove_nth
   2.05%  libtcmalloc.so.4.2.6          [.] operator new
   1.90%  bro                           [.] Attributes::FindAttr
   1.41%  bro                           [.] Dictionary::NextEntry
   1.27%  libc-2.17.so                  [.] __memcpy_ssse3_back
   0.97%  bro                           [.] StmtList::Exec
   0.87%  bro                           [.] Dictionary::Lookup
   0.85%  bro                           [.] NameExpr::Eval
   0.84%  bro                           [.] BroFunc::Call
   0.80%  libtcmalloc.so.4.2.6          [.] tc_free
   0.77%  libtcmalloc.so.4.2.6          [.] operator delete[]
   0.70%  bro                           [.] ones_complement_checksum
   0.60%  libtcmalloc.so.4.2.6          [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
   0.60%  bro                           [.] RecordVal::RecordVal
   0.53%  bro                           [.] UnaryExpr::Eval
   0.51%  bro                           [.] ExprStmt::Exec
   0.51%  bro                           [.] iosource::Manager::FindSoonest
   0.50%  libtcmalloc.so.4.2.6          [.] operator new[]


Which sums up to 59.2%

BaseList::remove/BaseList::remove_nth seems particularly easy to optimize. 
Can't that loop be replaced by a memmove?
I think something may be broken if it's being called that much though.



— 
Justin Azoff


_______________________________________________
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev

Reply via email to