Hi.

I have some more information on this problem. I consulted with people
from the Python mailing-list and this is the answer I received:

"In older Pythons the heuristic used to decide when to run the cyclic
garbage collection is not well suited for the creation of many objects
in a row."

The suggestion is to temporarily disable garbage collection for the
creation of the list. It seems to help a lot with this issue:

(ads)[bthiell@adsx ~]$ ipython
Python 2.6.5 (r265:79063, Feb 28 2011, 21:55:45)
IPython 0.11 -- An enhanced Interactive Python.

In [1]: import gc

In [2]: def test():
   ...:     gc.disable()
   ...:     try:
   ...:         return [(i,) for i in range(10**7)]
   ...:     finally:
   ...:         gc.enable()
   ...:

In [3]: %time x1 = test()
CPU times: user 1.54 s, sys: 0.34 s, total: 1.88 s
Wall time: 1.88 s

In [4]: %time x2 = test()
CPU times: user 1.58 s, sys: 0.32 s, total: 1.90 s
Wall time: 1.90 s

In [5]: %time x3 = test()
CPU times: user 1.68 s, sys: 0.36 s, total: 2.05 s
Wall time: 2.04 s

In [6]: big_list = range(50000000)

In [7]: %time x4 = test()
CPU times: user 2.08 s, sys: 0.33 s, total: 2.41 s
Wall time: 2.41 s

In [8]: %time x5 = test()
CPU times: user 4.39 s, sys: 0.30 s, total: 4.69 s
Wall time: 4.68 s

In [9]: %time x6 = test()
CPU times: user 1.58 s, sys: 0.34 s, total: 1.92 s
Wall time: 1.92 s

Now remains to know how we can fix this problem in MySQLdb.

Cheers.
Benoit.

On Thu, Dec 22, 2011 at 8:42 PM, Roman Chyla <[email protected]> wrote:
> Hi Benoit,
>
> My Python is behaving differently (see below). If I understand
> correctly, you are saying that a variable affects an operation of a
> function of a different scope - but I find it difficult to imagine.
> How is it possible? (other than consuming all RAM available and
> forcing Python to use swap).
>
> Note, when I tried the first test, I noticed mysql running full speed
> even if I closed my ipython session. I had to restart the mysql
> server. Could it be that zombie mysql connections consume almost all
> of your RAM and force swapping after "i=range(5000000)"?
>
>
> In [1]: def run_sql(no):
>   ...:     return [(i,) for i in xrange(no)]
>   ...:
>
> In [2]: def run_sql2(no):
>    return [(i,) for i in range(no)]
>   ...:
>
> In [4]: def run_sql3():
>    return [(i,) for i in range(1000000)]
>   ...:
>
> In [6]: %time res = run_sql(1000000)
> CPU times: user 0.11 s, sys: 0.11 s, total: 0.22 s
> Wall time: 0.22 s
>
> In [8]: %time res = run_sql2(1000000)
> CPU times: user 0.21 s, sys: 0.13 s, total: 0.34 s
> Wall time: 0.34 s
>
> In [10]: %time res = run_sql3()
> CPU times: user 0.28 s, sys: 0.01 s, total: 0.29 s
> Wall time: 0.29 s
>
> In [12]: a = range(50000000)
>
> In [13]: %time res = run_sql(1000000)
> CPU times: user 2.25 s, sys: 0.04 s, total: 2.29 s
> Wall time: 2.29 s
>
> In [15]: %time res = run_sql2(1000000)
> CPU times: user 0.28 s, sys: 0.00 s, total: 0.28 s
> Wall time: 0.28 s
>
> In [17]: %time res = run_sql3()
> CPU times: user 0.27 s, sys: 0.00 s, total: 0.27 s
> Wall time: 0.27 s
>
> In [19]: %timeit res=run_sql(1000000)
> 10 loops, best of 3: 135 ms per loop
>
> In [20]: %timeit res=run_sql2(1000000)
> 10 loops, best of 3: 176 ms per loop
>
> In [21]: %timeit res=run_sql3()
> 10 loops, best of 3: 175 ms per loop
>
>
>
> On Fri, Dec 23, 2011 at 12:01 AM, Benoit Thiell <[email protected]> 
> wrote:
>> Hi Roman.
>>
>> Thanks for your test. Here are the latest news:
>>
>> I tried with yet another MySQL bridge (https://launchpad.net/myconnpy)
>> and got very similar results. This makes me think that it is not a
>> problem that is specific to a Python MySQL bridge but maybe something
>> lower level. So, on an advice from Alberto, I ran the following test
>> which emulates a SQL request and returns a list of 1M tuples
>> containing incrementing integers:
>>
>>    In [1]: def run_sql():
>>       ...:     return [(i,) for i in range(1000000)]
>>
>>    In [2]: %time res = run_sql()
>>    CPU times: user 0.94 s, sys: 0.05 s, total: 0.99 s
>>    Wall time: 0.99 s
>>
>>    In [3]: i = range(50000000)
>>
>>    In [4]: %time res = run_sql()
>>    CPU times: user 10.33 s, sys: 0.04 s, total: 10.37 s
>>    Wall time: 10.37 s
>>
>>    In [5]: i = range(50000000)
>>
>>    In [6]: %time res = run_sql()
>>    CPU times: user 11.41 s, sys: 0.00 s, total: 11.41 s
>>    Wall time: 11.41 s
>>
>> You can see that the creation of a data structure similar to the one
>> returned by run_sql() takes significantly longer when a big list is
>> already in memory. I don't quite understand why the third "SQL
>> request" takes not much longer than the second.
>>
>> This is a very serious problem for us because it means that any SQL
>> request will take a long time to return just because we have two
>> gigantic citation dictionaries loaded in memory. So either we find a
>> way to prevent this problem, or we need to find a way to not load the
>> dictionaries in memory.
>>
>> Benoit.
>>
>> On Thu, Dec 22, 2011 at 4:00 PM, Roman Chyla <[email protected]> wrote:
>>> Hi,
>>> I have tried the same - could not reproduce with 1M rows, but with 10M
>>> yes. It started eating too much memory and swap. The objects are not
>>> reclaimed (and no garbage collector helps, not resetting the objects,
>>> that could mean object outside of Python, int he c-extension -
>>> however, after some time it got somewhat better; the 'somewhat' isn't
>>> very precise, I know :)). I think it might be a mysqldb bug.
>>>
>>> In [11]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 9.19 s, sys: 0.19 s, total: 9.38 s
>>> Wall time: 12.55 s
>>>
>>> In [13]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 8.65 s, sys: 0.11 s, total: 8.76 s
>>> Wall time: 11.88 s
>>>
>>> In [15]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 9.30 s, sys: 0.07 s, total: 9.37 s
>>> Wall time: 12.52 s
>>>
>>> In [17]: len(res)
>>> Out[17]: 10000000
>>>
>>> In [18]: i = range(50000000)
>>>
>>> In [19]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 13.21 s, sys: 3.06 s, total: 16.27 s
>>> Wall time: 71.10 s
>>>
>>> In [21]: import gc
>>>
>>> In [22]: gc.collect()
>>> Out[22]: 3
>>>
>>> In [23]: i = []
>>>
>>> In [24]: gc.collect()
>>> Out[24]: 3
>>>
>>> In [25]: gc.isenabled()
>>> Out[25]: True
>>>
>>> In [26]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 14.48 s, sys: 4.38 s, total: 18.86 s
>>> Wall time: 56.63 s
>>>
>>> In [28]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 10.59 s, sys: 1.20 s, total: 11.79 s
>>> Wall time: 24.57 s
>>>
>>> In [30]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 10.28 s, sys: 0.30 s, total: 10.58 s
>>> Wall time: 15.50 s
>>>
>>> In [32]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 8.70 s, sys: 0.09 s, total: 8.79 s
>>> Wall time: 12.17 s
>>>
>>> In [34]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 9.79 s, sys: 0.26 s, total: 10.05 s
>>> Wall time: 14.50 s
>>>
>>> In [36]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 9.91 s, sys: 0.31 s, total: 10.22 s
>>> Wall time: 15.92 s
>>>
>>> In [38]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 8.63 s, sys: 0.11 s, total: 8.74 s
>>> Wall time: 12.05 s
>>>
>>> In [40]: len(i)
>>> Out[40]: 0
>>>
>>> In [41]: i = range(50000000)
>>>
>>> In [42]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 22.97 s, sys: 4.95 s, total: 27.92 s
>>> Wall time: 78.04 s
>>>
>>> In [44]: i = []
>>>
>>> In [45]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 15.02 s, sys: 3.32 s, total: 18.34 s
>>> Wall time: 52.45 s
>>>
>>> In [47]: gc.collect()
>>> Out[47]: 3
>>>
>>> In [48]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>> CPU times: user 15.81 s, sys: 4.08 s, total: 19.89 s
>>> Wall time: 57.72 s
>>>
>>>
>>> roman
>>>
>>> On Thu, Dec 22, 2011 at 2:26 PM, Benoit Thiell <[email protected]> 
>>> wrote:
>>>> Hi Jorban.
>>>>
>>>> Thanks for taking the time to advise on this issue. The query I used
>>>> is not the one that caused problem originally, I just chose it because
>>>> it returns a large number of results. So I can't really change
>>>> anything on this side.
>>>>
>>>> Also the query runs fine by itself, without dictionaries loaded in
>>>> memory. So I think that it would be nice to try and get a sense of how
>>>> much big data structures loaded in memory impact large SQL queries.
>>>>
>>>> Benoit.
>>>>
>>>> On Thu, Dec 22, 2011 at 5:26 AM, Ferran Jorba <[email protected]> wrote:
>>>>> Hello Benoit,
>>>>>
>>>>> [...]
>>>>>>     In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
>>>>>> LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s
>>>>>>     Wall time: 2.30 s
>>>>>>
>>>>>> Any idea about why we're seeing this and how we can fix it? It is
>>>>>> quite a big problem for us as our citation dictionaries are so big.
>>>>>
>>>>> I have noticed in more than one case that for some minimally complex
>>>>> (?!)  operations the bottleneck is MySQL, not Python, so if can move
>>>>> part of the manipulation from one the other you have surprises.  I
>>>>> cannot remember the exact case, but the equivalent with yours should be
>>>>> changing:
>>>>>
>>>>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")
>>>>>
>>>>> to:
>>>>>
>>>>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x")
>>>>>  res = res[:1000000]
>>>>>
>>>>> I remember gains of 10x.  YMMV, but you can try it.
>>>>>
>>>>> Ferran
>>>>
>>>>
>>>>
>>>> --
>>>> Benoit Thiell
>>>> The SAO/NASA Astrophysics Data System
>>>> http://adswww.harvard.edu/
>>
>>
>>
>> --
>> Benoit Thiell
>> The SAO/NASA Astrophysics Data System
>> http://adswww.harvard.edu/



-- 
Benoit Thiell
The SAO/NASA Astrophysics Data System
http://adswww.harvard.edu/

Reply via email to