Hi Benoit Since my notebook has limited RAM, I have used range(10000000) and 'a' * 20000000 instead of range(50000000) and 'a' * 2000000000. I could repoduce both scenarios on my virtual machine (Debian vagrant provided by Tibor) with a less significant time increase due to the lower constants.
I then created three files: * test1.py containing statements 1 and 2 * test2.py containing statements 1, 2, 3 and 4 * test3.py containing statements 1, 2, 3, 4, 5 and 6 Subsequently I executed a profiler for all three files and compared the output: python -m cProfile test1.py. The most significant amount of time is spent in cursors.py with an increased average time per function on a scale test1.py to test3.py. Run the profiler on your installation and maybe there is additional time-consuming code. There could also exist newer or alternative implementations of cursors.py eradicating this issue. Happy holidays! Cheers Patrick ________________________________________ From: [email protected] [[email protected]] on behalf of Benoit Thiell [[email protected]] Sent: Wednesday, December 21, 2011 9:58 PM To: project-invenio-devel (Invenio developers mailing-list) Subject: Slow MySQL queries with large data structures in memory Hi. We have 10M records and 50M references which translate to very big citation dictionaries (several ). When Giovanni was at CERN, Tibor and him discovered that some queries took a much longer time to execute with the citation dictionaries loaded in memory. I've investigated the problem and this is what I've found so far. It seems to be a general problem which does not specifically involve Invenio, only Python and MySQLdb. Here is a test showing the problem. We run the same query three times. It's very basic and returns 1M of integers. The first one is ran right after the start of IPython (143M of memory used). The second is with a list of 50M integers in memory (1681M of memory) and the last one with two such lists (2854M of memory). You can see that the queries take more and more time to execute. In [1]: from invenio.dbquery import run_sql In [2]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") CPU times: user 1.44 s, sys: 0.08 s, total: 1.52 s Wall time: 1.92 s In [3]: i = range(50000000) In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") CPU times: user 11.36 s, sys: 0.07 s, total: 11.43 s Wall time: 11.67 s In [5]: j = range(50000000) In [6]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") CPU times: user 21.21 s, sys: 0.06 s, total: 21.27 s Wall time: 21.54 s It is interesting to notice that if we retry this experiment using strings to increase the memory footprint, we don't get the same results (2051M of memory used): In [1]: from invenio.dbquery import run_sql In [2]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") CPU times: user 1.39 s, sys: 0.08 s, total: 1.47 s Wall time: 1.77 s In [3]: i = 'a' * 2000000000 In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s Wall time: 2.30 s Any idea about why we're seeing this and how we can fix it? It is quite a big problem for us as our citation dictionaries are so big. Cheers. -- Benoit Thiell The SAO/NASA Astrophysics Data System http://adswww.harvard.edu/

