Hi. I have some more information on this problem. I consulted with people from the Python mailing-list and this is the answer I received:
"In older Pythons the heuristic used to decide when to run the cyclic garbage collection is not well suited for the creation of many objects in a row." The suggestion is to temporarily disable garbage collection for the creation of the list. It seems to help a lot with this issue: (ads)[bthiell@adsx ~]$ ipython Python 2.6.5 (r265:79063, Feb 28 2011, 21:55:45) IPython 0.11 -- An enhanced Interactive Python. In [1]: import gc In [2]: def test(): ...: gc.disable() ...: try: ...: return [(i,) for i in range(10**7)] ...: finally: ...: gc.enable() ...: In [3]: %time x1 = test() CPU times: user 1.54 s, sys: 0.34 s, total: 1.88 s Wall time: 1.88 s In [4]: %time x2 = test() CPU times: user 1.58 s, sys: 0.32 s, total: 1.90 s Wall time: 1.90 s In [5]: %time x3 = test() CPU times: user 1.68 s, sys: 0.36 s, total: 2.05 s Wall time: 2.04 s In [6]: big_list = range(50000000) In [7]: %time x4 = test() CPU times: user 2.08 s, sys: 0.33 s, total: 2.41 s Wall time: 2.41 s In [8]: %time x5 = test() CPU times: user 4.39 s, sys: 0.30 s, total: 4.69 s Wall time: 4.68 s In [9]: %time x6 = test() CPU times: user 1.58 s, sys: 0.34 s, total: 1.92 s Wall time: 1.92 s Now remains to know how we can fix this problem in MySQLdb. Cheers. Benoit. On Thu, Dec 22, 2011 at 8:42 PM, Roman Chyla <[email protected]> wrote: > Hi Benoit, > > My Python is behaving differently (see below). If I understand > correctly, you are saying that a variable affects an operation of a > function of a different scope - but I find it difficult to imagine. > How is it possible? (other than consuming all RAM available and > forcing Python to use swap). > > Note, when I tried the first test, I noticed mysql running full speed > even if I closed my ipython session. I had to restart the mysql > server. Could it be that zombie mysql connections consume almost all > of your RAM and force swapping after "i=range(5000000)"? > > > In [1]: def run_sql(no): > ...: return [(i,) for i in xrange(no)] > ...: > > In [2]: def run_sql2(no): > return [(i,) for i in range(no)] > ...: > > In [4]: def run_sql3(): > return [(i,) for i in range(1000000)] > ...: > > In [6]: %time res = run_sql(1000000) > CPU times: user 0.11 s, sys: 0.11 s, total: 0.22 s > Wall time: 0.22 s > > In [8]: %time res = run_sql2(1000000) > CPU times: user 0.21 s, sys: 0.13 s, total: 0.34 s > Wall time: 0.34 s > > In [10]: %time res = run_sql3() > CPU times: user 0.28 s, sys: 0.01 s, total: 0.29 s > Wall time: 0.29 s > > In [12]: a = range(50000000) > > In [13]: %time res = run_sql(1000000) > CPU times: user 2.25 s, sys: 0.04 s, total: 2.29 s > Wall time: 2.29 s > > In [15]: %time res = run_sql2(1000000) > CPU times: user 0.28 s, sys: 0.00 s, total: 0.28 s > Wall time: 0.28 s > > In [17]: %time res = run_sql3() > CPU times: user 0.27 s, sys: 0.00 s, total: 0.27 s > Wall time: 0.27 s > > In [19]: %timeit res=run_sql(1000000) > 10 loops, best of 3: 135 ms per loop > > In [20]: %timeit res=run_sql2(1000000) > 10 loops, best of 3: 176 ms per loop > > In [21]: %timeit res=run_sql3() > 10 loops, best of 3: 175 ms per loop > > > > On Fri, Dec 23, 2011 at 12:01 AM, Benoit Thiell <[email protected]> > wrote: >> Hi Roman. >> >> Thanks for your test. Here are the latest news: >> >> I tried with yet another MySQL bridge (https://launchpad.net/myconnpy) >> and got very similar results. This makes me think that it is not a >> problem that is specific to a Python MySQL bridge but maybe something >> lower level. So, on an advice from Alberto, I ran the following test >> which emulates a SQL request and returns a list of 1M tuples >> containing incrementing integers: >> >> In [1]: def run_sql(): >> ...: return [(i,) for i in range(1000000)] >> >> In [2]: %time res = run_sql() >> CPU times: user 0.94 s, sys: 0.05 s, total: 0.99 s >> Wall time: 0.99 s >> >> In [3]: i = range(50000000) >> >> In [4]: %time res = run_sql() >> CPU times: user 10.33 s, sys: 0.04 s, total: 10.37 s >> Wall time: 10.37 s >> >> In [5]: i = range(50000000) >> >> In [6]: %time res = run_sql() >> CPU times: user 11.41 s, sys: 0.00 s, total: 11.41 s >> Wall time: 11.41 s >> >> You can see that the creation of a data structure similar to the one >> returned by run_sql() takes significantly longer when a big list is >> already in memory. I don't quite understand why the third "SQL >> request" takes not much longer than the second. >> >> This is a very serious problem for us because it means that any SQL >> request will take a long time to return just because we have two >> gigantic citation dictionaries loaded in memory. So either we find a >> way to prevent this problem, or we need to find a way to not load the >> dictionaries in memory. >> >> Benoit. >> >> On Thu, Dec 22, 2011 at 4:00 PM, Roman Chyla <[email protected]> wrote: >>> Hi, >>> I have tried the same - could not reproduce with 1M rows, but with 10M >>> yes. It started eating too much memory and swap. The objects are not >>> reclaimed (and no garbage collector helps, not resetting the objects, >>> that could mean object outside of Python, int he c-extension - >>> however, after some time it got somewhat better; the 'somewhat' isn't >>> very precise, I know :)). I think it might be a mysqldb bug. >>> >>> In [11]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 9.19 s, sys: 0.19 s, total: 9.38 s >>> Wall time: 12.55 s >>> >>> In [13]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 8.65 s, sys: 0.11 s, total: 8.76 s >>> Wall time: 11.88 s >>> >>> In [15]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 9.30 s, sys: 0.07 s, total: 9.37 s >>> Wall time: 12.52 s >>> >>> In [17]: len(res) >>> Out[17]: 10000000 >>> >>> In [18]: i = range(50000000) >>> >>> In [19]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 13.21 s, sys: 3.06 s, total: 16.27 s >>> Wall time: 71.10 s >>> >>> In [21]: import gc >>> >>> In [22]: gc.collect() >>> Out[22]: 3 >>> >>> In [23]: i = [] >>> >>> In [24]: gc.collect() >>> Out[24]: 3 >>> >>> In [25]: gc.isenabled() >>> Out[25]: True >>> >>> In [26]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 14.48 s, sys: 4.38 s, total: 18.86 s >>> Wall time: 56.63 s >>> >>> In [28]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 10.59 s, sys: 1.20 s, total: 11.79 s >>> Wall time: 24.57 s >>> >>> In [30]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 10.28 s, sys: 0.30 s, total: 10.58 s >>> Wall time: 15.50 s >>> >>> In [32]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 8.70 s, sys: 0.09 s, total: 8.79 s >>> Wall time: 12.17 s >>> >>> In [34]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 9.79 s, sys: 0.26 s, total: 10.05 s >>> Wall time: 14.50 s >>> >>> In [36]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 9.91 s, sys: 0.31 s, total: 10.22 s >>> Wall time: 15.92 s >>> >>> In [38]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 8.63 s, sys: 0.11 s, total: 8.74 s >>> Wall time: 12.05 s >>> >>> In [40]: len(i) >>> Out[40]: 0 >>> >>> In [41]: i = range(50000000) >>> >>> In [42]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 22.97 s, sys: 4.95 s, total: 27.92 s >>> Wall time: 78.04 s >>> >>> In [44]: i = [] >>> >>> In [45]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 15.02 s, sys: 3.32 s, total: 18.34 s >>> Wall time: 52.45 s >>> >>> In [47]: gc.collect() >>> Out[47]: 3 >>> >>> In [48]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") >>> CPU times: user 15.81 s, sys: 4.08 s, total: 19.89 s >>> Wall time: 57.72 s >>> >>> >>> roman >>> >>> On Thu, Dec 22, 2011 at 2:26 PM, Benoit Thiell <[email protected]> >>> wrote: >>>> Hi Jorban. >>>> >>>> Thanks for taking the time to advise on this issue. The query I used >>>> is not the one that caused problem originally, I just chose it because >>>> it returns a large number of results. So I can't really change >>>> anything on this side. >>>> >>>> Also the query runs fine by itself, without dictionaries loaded in >>>> memory. So I think that it would be nice to try and get a sense of how >>>> much big data structures loaded in memory impact large SQL queries. >>>> >>>> Benoit. >>>> >>>> On Thu, Dec 22, 2011 at 5:26 AM, Ferran Jorba <[email protected]> wrote: >>>>> Hello Benoit, >>>>> >>>>> [...] >>>>>> In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x >>>>>> LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s >>>>>> Wall time: 2.30 s >>>>>> >>>>>> Any idea about why we're seeing this and how we can fix it? It is >>>>>> quite a big problem for us as our citation dictionaries are so big. >>>>> >>>>> I have noticed in more than one case that for some minimally complex >>>>> (?!) operations the bottleneck is MySQL, not Python, so if can move >>>>> part of the manipulation from one the other you have surprises. I >>>>> cannot remember the exact case, but the equivalent with yours should be >>>>> changing: >>>>> >>>>> res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") >>>>> >>>>> to: >>>>> >>>>> res = run_sql("SELECT id_bibrec FROM bibrec_bib03x") >>>>> res = res[:1000000] >>>>> >>>>> I remember gains of 10x. YMMV, but you can try it. >>>>> >>>>> Ferran >>>> >>>> >>>> >>>> -- >>>> Benoit Thiell >>>> The SAO/NASA Astrophysics Data System >>>> http://adswww.harvard.edu/ >> >> >> >> -- >> Benoit Thiell >> The SAO/NASA Astrophysics Data System >> http://adswww.harvard.edu/ -- Benoit Thiell The SAO/NASA Astrophysics Data System http://adswww.harvard.edu/

