Hi Benoit

Since my notebook has limited RAM, I have used range(10000000) and  'a' * 
20000000 instead of range(50000000) and 'a' * 2000000000. I could repoduce both 
scenarios on my virtual machine (Debian vagrant provided by Tibor) with a less 
significant time increase due to the lower constants.

I then created three files:
 * test1.py containing statements 1 and 2
 * test2.py containing statements 1, 2, 3 and 4
 * test3.py containing statements 1, 2, 3, 4, 5 and 6

Subsequently I executed a profiler for all three files and compared the output: 
python -m cProfile test1.py.
The most significant amount of time is spent in cursors.py with an increased 
average time per function on a scale test1.py to test3.py.

Run the profiler on your installation and maybe there is additional 
time-consuming code. There could also exist newer or alternative 
implementations of cursors.py eradicating this issue.

Happy holidays!

Cheers
Patrick

________________________________________
From: [email protected] [[email protected]] on behalf of 
Benoit Thiell [[email protected]]
Sent: Wednesday, December 21, 2011 9:58 PM
To: project-invenio-devel (Invenio developers mailing-list)
Subject: Slow MySQL queries with large data structures in memory

Hi.

We have 10M records and 50M references which translate to very big
citation dictionaries (several ).

When Giovanni was at CERN, Tibor and him discovered that some queries
took a much longer time to execute with the citation dictionaries
loaded in memory. I've investigated the problem and this is what I've
found so far. It seems to be a general problem which does not
specifically involve Invenio, only Python and MySQLdb.

Here is a test showing the problem. We run the same query three times.
It's very basic and returns 1M of integers. The first one is ran right
after the start of IPython (143M of memory used). The second is with a
list of 50M integers in memory (1681M of memory) and the last one with
two such lists (2854M of memory).

You can see that the queries take more and more time to execute.

    In [1]: from invenio.dbquery import run_sql

    In [2]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
LIMIT 1000000")
    CPU times: user 1.44 s, sys: 0.08 s, total: 1.52 s
    Wall time: 1.92 s

    In [3]: i = range(50000000)

    In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
LIMIT 1000000")
    CPU times: user 11.36 s, sys: 0.07 s, total: 11.43 s
    Wall time: 11.67 s

    In [5]: j = range(50000000)

    In [6]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
LIMIT 1000000")
    CPU times: user 21.21 s, sys: 0.06 s, total: 21.27 s
    Wall time: 21.54 s

It is interesting to notice that if we retry this experiment using
strings to increase the memory footprint, we don't get the same
results (2051M of memory used):

    In [1]: from invenio.dbquery import run_sql

    In [2]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
LIMIT 1000000")
    CPU times: user 1.39 s, sys: 0.08 s, total: 1.47 s
    Wall time: 1.77 s

    In [3]: i = 'a' * 2000000000

    In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s
    Wall time: 2.30 s

Any idea about why we're seeing this and how we can fix it? It is
quite a big problem for us as our citation dictionaries are so big.

Cheers.

--
Benoit Thiell
The SAO/NASA Astrophysics Data System
http://adswww.harvard.edu/

Reply via email to