#21: BibRank: micro-optimize citation dict memory footprint
-------------------------+--------------------------------------------------
 Reporter:  simko        |       Owner:     
     Type:  enhancement  |      Status:  new
 Priority:  major        |   Milestone:     
Component:  BibRank      |     Version:     
 Keywords:               |  
-------------------------+--------------------------------------------------
 The citation dictionary is cached inside each WSGI Invenio daemon
 process for speed purposes.  It looks like this: (for the demo site)

 {{{
 {18: [96],
  74: [92],
  77: [85, 86],
  78: [79, 91],
  79: [91],
  81: [82, 83, 87, 89],
  84: [85, 88, 91],
  91: [92],
  94: [80],
  95: [77, 86]}
 }}}

 For bigger sites containing 1M of records and having fuller citation
 maps, this dictionary can get quite big, e.g. WSGI daemon processes of
 the INSPIRE instance eat about 1 GB of RAM.

 It would be good to decrease the memory footprint of this citation
 dictionary, especially since we are running on a 64-bit OS, where we
 may easily consume more bytes to store list elements (of `unsigned
 mediumint' type) than necessary.

 We should investigate potential local replacements for the list
 structure, for example using {{{numpy.array}}}.  We can measure the
 memory footprint of various data structures via {{{sys.getsizeof()}}}
 or via {{{ps auxw}}} process sizes, aiming to find a more memory
 optimized, yet still fast enough, data structure to represent the
 citation dict.

 If needed, we can even create a dedicated intbitset-like C extension,
 that would be capable of storing recID vectors in a memory-efficient
 way.  This is arguably the best micro-optimization technique that we
 could go for, albeit it would represent a bit more work than reusing
 {{{numpy.array}}} or other some such pre-existing module.

 Note that this task is of a micro-optimization kind only, keeping the
 overall citation indexer and searcher machinery unchanged, only
 changing its internal data structures.  The tests will show how much
 such a micro-optimization would be worth it.  The overall rethinking
 of the citation dictionary handling and the inherent memory sharing
 procedures would be another task, see some older musings at
 [https://twiki.cern.ch/twiki/bin/view/CDS/InvenioScalability].

-- 
Ticket URL: <http://cdswaredev.cern.ch/invenio/ticket/21>
CDS Invenio <http://cdswaredev.cern.ch/invenio>

Reply via email to