On Wed, 04 Jan 2012, Benoit Thiell wrote:
> One problem is that this might not only affect dbquery, but also every
> part of the code that could deal with very big lists or dictionaries.
> What is left now is to determine whether how much this problem affects
> Invenio and where else the fix needs to be applied.

Reading this thread only now... Interesting findings.  I confirm I can
see the problem with Python-2.6 but not with Python-2.7.  Hence the
first option:

- What about compiling /opt/python-2.7 on your CentOS servers?  This can
  take care of all `hidden' similar GC-related issues in a natural way.

As for the workaround:

- We can wrap turning GC on/off in run_sql() and friends.  Can you
  please make a few tests under your site conditions whether this would
  cover some (or most) of the repetitive slowness you've seen?

  E.g. you can try a simple search like `980:astronomy' repetitively and
  re-measure response times and stuff.  This search yields lots of
  results and it does not use intbitsets internally, so run_sql() would
  be busy generating lots of tuples, and the system would be challenged
  via search_unit_in_bibxxx() that way.

  E.g. you can also try emptying out your citation dictionaries (by
  truncating `rnkCITATIONDATA' and/or storing empty dicts there) and
  re-measure things.  This could help in seeing to how far the problems
  are induced by co-existence of big citation dictionaries in memory.

Another thing to consider in parallel:

- We could use different data structure for representing citation
  dictionaries in Invenio, not using Python native dict-of-lists.  Marko
  did some tests in the past to see how much could be gained RAM
  consumption wise, see: 
  
      <http://invenio-software.org/ticket/21>

  Some of the data structures tried were not giving too dramatic
  improvement RAM-size wise, so it was better to look into the direction
  of a standalone WSGI process for Pyro-like handling of citations
  instead.  But even if changing the data structure would not help 10x
  or so memory consumption wise, it can help in big list GC matters
  perhaps.
  
  E.g. we can try to Cythonise the citation dictionary, similarly to how
  we cythonised Numeric vectors into intbitsets, if numarray/numpy is
  not enough.

- We can revive the option of using standalone WSGI process for all
  citation handling, since this would enable Invenio to use numerous
  smaller WSGI processes on the front-end for answering all the
  non-citation requests.

We can naturally pursue all these tracks in parallel.  Even if you'd opt
out for using locally-maintained Python-2.7, it may be profitable to
micro-optimise our data structures, and in parallel to go for standalone
WSGI citation process for better scalability of non-citation requests.

P.S. Some of the issues observed in your installation are loosely
     related to the problem at hand, e.g. the double-like time stamp
     verification in bibrank_citation_searcher is still on my agenda.
     But I guess you are testing with the time stamp verification off,
     as we did with Giovanni when he was here.

Best regards
--
Tibor Simko

Reply via email to