Hi Tibor. On Thu, Jan 5, 2012 at 7:54 AM, Tibor Simko <[email protected]> wrote: > On Wed, 04 Jan 2012, Benoit Thiell wrote: >> One problem is that this might not only affect dbquery, but also every >> part of the code that could deal with very big lists or dictionaries. >> What is left now is to determine whether how much this problem affects >> Invenio and where else the fix needs to be applied. > > Reading this thread only now... Interesting findings. I confirm I can > see the problem with Python-2.6 but not with Python-2.7. Hence the > first option: > > - What about compiling /opt/python-2.7 on your CentOS servers? This can > take care of all `hidden' similar GC-related issues in a natural way.
This would be awesome but we can't really use our own compiled Python version in production. So we have to make the best with what we have at hand. > As for the workaround: > > - We can wrap turning GC on/off in run_sql() and friends. Can you > please make a few tests under your site conditions whether this would > cover some (or most) of the repetitive slowness you've seen? The patch that I attached to this email shows a possible implementation of this. > E.g. you can try a simple search like `980:astronomy' repetitively and > re-measure response times and stuff. This search yields lots of > results and it does not use intbitsets internally, so run_sql() would > be busy generating lots of tuples, and the system would be challenged > via search_unit_in_bibxxx() that way. Here is a table that summarize the results I had while testing search_unit_in_bibxxx with different queries, with or without garbage collection and with or without citation dictionaries. The values are in μs/record. +-----------------+-----------+-----------+-----------+-----------+ | Query | EPRINT | GENERAL | ASTRONOMY | PHYSICS | +-----------------+-----------+-----------+-----------+-----------+ | # of results | 705,000 | 910,000 | 1,850,000 | 6,090,000 | +---------+-------+-----------+-----------+-----------+-----------+ | | gc | 2.86 | 2.03 | 2.35 | 3.47 | | no dict +-------+-----------+-----------+-----------+-----------+ | | no gc | 2.51 | 1.71 | 1.61 | 1.48 | +---------+-------+-----------+-----------+-----------+-----------+ | | gc | 30.02 | 29.81 | 29.57 | 30.57 | + dict +-------+-----------+-----------+-----------+-----------+ | | no gc | 2.58 | 2.13 | 1.78 | 1.49 | +---------+-------+-----------+-----------+-----------+-----------+ (The script I used to test this is attached to this email along with a patch that makes the modifications to dbquery and search_engine.) As you can see, turning off the garbage collection always results in faster queries, this improvement being more noticeable for large query results. So as far as we're concerned, the It might be interesting to run similar tests for an Inspire-like Invenio instance. > - We can revive the option of using standalone WSGI process for all > citation handling, since this would enable Invenio to use numerous > smaller WSGI processes on the front-end for answering all the > non-citation requests. This looks like a promising option to me. Is there any literature on this that would give me an idea of how this is supposed to work? > We can naturally pursue all these tracks in parallel. Even if you'd opt > out for using locally-maintained Python-2.7, it may be profitable to > micro-optimise our data structures, and in parallel to go for standalone > WSGI citation process for better scalability of non-citation requests. I don't think that we can use Python 2.7 other than for experimenting. Our current plan is to use CentOS 6 with Python 2.6 in production. I agree that the data structures need to be optimized because the memory usage has already proven several times to be problematic. > P.S. Some of the issues observed in your installation are loosely > related to the problem at hand, e.g. the double-like time stamp > verification in bibrank_citation_searcher is still on my agenda. > But I guess you are testing with the time stamp verification off, > as we did with Giovanni when he was here. Indeed, I am always testing with the timestamp verification off. One other thing I tried was to store the citation dictionaries in a file rather than in MySQL as I fail to understand what's the advantage of the database storage is. The timestamp verification in this case would be much faster as it would only require to get the modification date of the citation dictionary file(s). Cheers. -- Benoit Thiell The SAO/NASA Astrophysics Data System http://adswww.harvard.edu/
0001-Add-the-option-to-turn-the-gc-in-run_sql.patch
Description: Binary data
# -*- coding: utf-8 -*-
import time
import invenio.search_engine as s
import invenio.bibrank_citation_searcher as bcs
def execute_single_test(query, disable_gc, n):
exec_times = []
for _ in range(n):
cur_time = time.time()
res = s.search_unit_in_bibxxx(query, '980__a', None, disable_gc=disable_gc)
exec_time = time.time() - cur_time
exec_times.append(exec_time)
exec_time = avg(exec_times)
print ' * With%s gc: %.2f s (%.2f μs/record)' % (
disable_gc and 'out' or '',
exec_time,
exec_time / len(res) * 10**6)
def avg(iterable):
return sum(iterable) / len(iterable)
def main(query='ASTRONOMY'):
print '\nQuery: 980__a:%s\n' % query
print 'Without citation dictionaries:'
execute_single_test(query, False, 3)
execute_single_test(query, True, 3)
bcs.load_citation_dictionaries()
print '\nWith citation dictionaries:'
execute_single_test(query, False, 3)
execute_single_test(query, True, 3)
if __name__ == '__main__':
import sys
query = sys.argv[-1]
main(query)

