On Wed, 04 Jan 2012, Benoit Thiell wrote:
> One problem is that this might not only affect dbquery, but also every
> part of the code that could deal with very big lists or dictionaries.
> What is left now is to determine whether how much this problem affects
> Invenio and where else the fix needs to be applied.
Reading this thread only now... Interesting findings. I confirm I can
see the problem with Python-2.6 but not with Python-2.7. Hence the
first option:
- What about compiling /opt/python-2.7 on your CentOS servers? This can
take care of all `hidden' similar GC-related issues in a natural way.
As for the workaround:
- We can wrap turning GC on/off in run_sql() and friends. Can you
please make a few tests under your site conditions whether this would
cover some (or most) of the repetitive slowness you've seen?
E.g. you can try a simple search like `980:astronomy' repetitively and
re-measure response times and stuff. This search yields lots of
results and it does not use intbitsets internally, so run_sql() would
be busy generating lots of tuples, and the system would be challenged
via search_unit_in_bibxxx() that way.
E.g. you can also try emptying out your citation dictionaries (by
truncating `rnkCITATIONDATA' and/or storing empty dicts there) and
re-measure things. This could help in seeing to how far the problems
are induced by co-existence of big citation dictionaries in memory.
Another thing to consider in parallel:
- We could use different data structure for representing citation
dictionaries in Invenio, not using Python native dict-of-lists. Marko
did some tests in the past to see how much could be gained RAM
consumption wise, see:
<http://invenio-software.org/ticket/21>
Some of the data structures tried were not giving too dramatic
improvement RAM-size wise, so it was better to look into the direction
of a standalone WSGI process for Pyro-like handling of citations
instead. But even if changing the data structure would not help 10x
or so memory consumption wise, it can help in big list GC matters
perhaps.
E.g. we can try to Cythonise the citation dictionary, similarly to how
we cythonised Numeric vectors into intbitsets, if numarray/numpy is
not enough.
- We can revive the option of using standalone WSGI process for all
citation handling, since this would enable Invenio to use numerous
smaller WSGI processes on the front-end for answering all the
non-citation requests.
We can naturally pursue all these tracks in parallel. Even if you'd opt
out for using locally-maintained Python-2.7, it may be profitable to
micro-optimise our data structures, and in parallel to go for standalone
WSGI citation process for better scalability of non-citation requests.
P.S. Some of the issues observed in your installation are loosely
related to the problem at hand, e.g. the double-like time stamp
verification in bibrank_citation_searcher is still on my agenda.
But I guess you are testing with the time stamp verification off,
as we did with Giovanni when he was here.
Best regards
--
Tibor Simko