Hi.

A little bit more information on this problem, and the implications
that it might have on a big Invenio instance.

The problem with the garbage collector being slow when creating large
list of tuples has already been reported as a Python issue:
http://bugs.python.org/issue4074. The problem itself has been fixed in
Python 2.7 (see point 2 of
http://docs.python.org/whatsnew/2.7.html#optimizations). I fear that
we're stuck with it for versions of Python older than 2.7.

The fix is relatively easy as it only requires to turn the garbage
collection off when creating big number of tuples. This can be done at
several levels:
* in dbquery.run_sql, around fetchall and fetchmany.
* in MySQLdb, in the method _fetch_row of the class BaseCursor.

One problem is that this might not only affect dbquery, but also every
part of the code that could deal with very big lists or dictionaries.
What is left now is to determine whether how much this problem affects
Invenio and where else the fix needs to be applied.

Benoit.

On Wed, Jan 4, 2012 at 9:55 AM, Benoit Thiell <[email protected]> wrote:
> Hi.
>
> I have some more information on this problem. I consulted with people
> from the Python mailing-list and this is the answer I received:
>
> "In older Pythons the heuristic used to decide when to run the cyclic
> garbage collection is not well suited for the creation of many objects
> in a row."
>
> The suggestion is to temporarily disable garbage collection for the
> creation of the list. It seems to help a lot with this issue:
>
> (ads)[bthiell@adsx ~]$ ipython
> Python 2.6.5 (r265:79063, Feb 28 2011, 21:55:45)
> IPython 0.11 -- An enhanced Interactive Python.
>
> In [1]: import gc
>
> In [2]: def test():
>   ...:     gc.disable()
>   ...:     try:
>   ...:         return [(i,) for i in range(10**7)]
>   ...:     finally:
>   ...:         gc.enable()
>   ...:
>
> In [3]: %time x1 = test()
> CPU times: user 1.54 s, sys: 0.34 s, total: 1.88 s
> Wall time: 1.88 s
>
> In [4]: %time x2 = test()
> CPU times: user 1.58 s, sys: 0.32 s, total: 1.90 s
> Wall time: 1.90 s
>
> In [5]: %time x3 = test()
> CPU times: user 1.68 s, sys: 0.36 s, total: 2.05 s
> Wall time: 2.04 s
>
> In [6]: big_list = range(50000000)
>
> In [7]: %time x4 = test()
> CPU times: user 2.08 s, sys: 0.33 s, total: 2.41 s
> Wall time: 2.41 s
>
> In [8]: %time x5 = test()
> CPU times: user 4.39 s, sys: 0.30 s, total: 4.69 s
> Wall time: 4.68 s
>
> In [9]: %time x6 = test()
> CPU times: user 1.58 s, sys: 0.34 s, total: 1.92 s
> Wall time: 1.92 s
>
> Now remains to know how we can fix this problem in MySQLdb.
>
> Cheers.
> Benoit.
>
> On Thu, Dec 22, 2011 at 8:42 PM, Roman Chyla <[email protected]> wrote:
>> Hi Benoit,
>>
>> My Python is behaving differently (see below). If I understand
>> correctly, you are saying that a variable affects an operation of a
>> function of a different scope - but I find it difficult to imagine.
>> How is it possible? (other than consuming all RAM available and
>> forcing Python to use swap).
>>
>> Note, when I tried the first test, I noticed mysql running full speed
>> even if I closed my ipython session. I had to restart the mysql
>> server. Could it be that zombie mysql connections consume almost all
>> of your RAM and force swapping after "i=range(5000000)"?
>>
>>
>> In [1]: def run_sql(no):
>>   ...:     return [(i,) for i in xrange(no)]
>>   ...:
>>
>> In [2]: def run_sql2(no):
>>    return [(i,) for i in range(no)]
>>   ...:
>>
>> In [4]: def run_sql3():
>>    return [(i,) for i in range(1000000)]
>>   ...:
>>
>> In [6]: %time res = run_sql(1000000)
>> CPU times: user 0.11 s, sys: 0.11 s, total: 0.22 s
>> Wall time: 0.22 s
>>
>> In [8]: %time res = run_sql2(1000000)
>> CPU times: user 0.21 s, sys: 0.13 s, total: 0.34 s
>> Wall time: 0.34 s
>>
>> In [10]: %time res = run_sql3()
>> CPU times: user 0.28 s, sys: 0.01 s, total: 0.29 s
>> Wall time: 0.29 s
>>
>> In [12]: a = range(50000000)
>>
>> In [13]: %time res = run_sql(1000000)
>> CPU times: user 2.25 s, sys: 0.04 s, total: 2.29 s
>> Wall time: 2.29 s
>>
>> In [15]: %time res = run_sql2(1000000)
>> CPU times: user 0.28 s, sys: 0.00 s, total: 0.28 s
>> Wall time: 0.28 s
>>
>> In [17]: %time res = run_sql3()
>> CPU times: user 0.27 s, sys: 0.00 s, total: 0.27 s
>> Wall time: 0.27 s
>>
>> In [19]: %timeit res=run_sql(1000000)
>> 10 loops, best of 3: 135 ms per loop
>>
>> In [20]: %timeit res=run_sql2(1000000)
>> 10 loops, best of 3: 176 ms per loop
>>
>> In [21]: %timeit res=run_sql3()
>> 10 loops, best of 3: 175 ms per loop
>>
>>
>>
>> On Fri, Dec 23, 2011 at 12:01 AM, Benoit Thiell <[email protected]> 
>> wrote:
>>> Hi Roman.
>>>
>>> Thanks for your test. Here are the latest news:
>>>
>>> I tried with yet another MySQL bridge (https://launchpad.net/myconnpy)
>>> and got very similar results. This makes me think that it is not a
>>> problem that is specific to a Python MySQL bridge but maybe something
>>> lower level. So, on an advice from Alberto, I ran the following test
>>> which emulates a SQL request and returns a list of 1M tuples
>>> containing incrementing integers:
>>>
>>>    In [1]: def run_sql():
>>>       ...:     return [(i,) for i in range(1000000)]
>>>
>>>    In [2]: %time res = run_sql()
>>>    CPU times: user 0.94 s, sys: 0.05 s, total: 0.99 s
>>>    Wall time: 0.99 s
>>>
>>>    In [3]: i = range(50000000)
>>>
>>>    In [4]: %time res = run_sql()
>>>    CPU times: user 10.33 s, sys: 0.04 s, total: 10.37 s
>>>    Wall time: 10.37 s
>>>
>>>    In [5]: i = range(50000000)
>>>
>>>    In [6]: %time res = run_sql()
>>>    CPU times: user 11.41 s, sys: 0.00 s, total: 11.41 s
>>>    Wall time: 11.41 s
>>>
>>> You can see that the creation of a data structure similar to the one
>>> returned by run_sql() takes significantly longer when a big list is
>>> already in memory. I don't quite understand why the third "SQL
>>> request" takes not much longer than the second.
>>>
>>> This is a very serious problem for us because it means that any SQL
>>> request will take a long time to return just because we have two
>>> gigantic citation dictionaries loaded in memory. So either we find a
>>> way to prevent this problem, or we need to find a way to not load the
>>> dictionaries in memory.
>>>
>>> Benoit.
>>>
>>> On Thu, Dec 22, 2011 at 4:00 PM, Roman Chyla <[email protected]> wrote:
>>>> Hi,
>>>> I have tried the same - could not reproduce with 1M rows, but with 10M
>>>> yes. It started eating too much memory and swap. The objects are not
>>>> reclaimed (and no garbage collector helps, not resetting the objects,
>>>> that could mean object outside of Python, int he c-extension -
>>>> however, after some time it got somewhat better; the 'somewhat' isn't
>>>> very precise, I know :)). I think it might be a mysqldb bug.
>>>>
>>>> In [11]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 9.19 s, sys: 0.19 s, total: 9.38 s
>>>> Wall time: 12.55 s
>>>>
>>>> In [13]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 8.65 s, sys: 0.11 s, total: 8.76 s
>>>> Wall time: 11.88 s
>>>>
>>>> In [15]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 9.30 s, sys: 0.07 s, total: 9.37 s
>>>> Wall time: 12.52 s
>>>>
>>>> In [17]: len(res)
>>>> Out[17]: 10000000
>>>>
>>>> In [18]: i = range(50000000)
>>>>
>>>> In [19]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 13.21 s, sys: 3.06 s, total: 16.27 s
>>>> Wall time: 71.10 s
>>>>
>>>> In [21]: import gc
>>>>
>>>> In [22]: gc.collect()
>>>> Out[22]: 3
>>>>
>>>> In [23]: i = []
>>>>
>>>> In [24]: gc.collect()
>>>> Out[24]: 3
>>>>
>>>> In [25]: gc.isenabled()
>>>> Out[25]: True
>>>>
>>>> In [26]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 14.48 s, sys: 4.38 s, total: 18.86 s
>>>> Wall time: 56.63 s
>>>>
>>>> In [28]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 10.59 s, sys: 1.20 s, total: 11.79 s
>>>> Wall time: 24.57 s
>>>>
>>>> In [30]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 10.28 s, sys: 0.30 s, total: 10.58 s
>>>> Wall time: 15.50 s
>>>>
>>>> In [32]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 8.70 s, sys: 0.09 s, total: 8.79 s
>>>> Wall time: 12.17 s
>>>>
>>>> In [34]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 9.79 s, sys: 0.26 s, total: 10.05 s
>>>> Wall time: 14.50 s
>>>>
>>>> In [36]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 9.91 s, sys: 0.31 s, total: 10.22 s
>>>> Wall time: 15.92 s
>>>>
>>>> In [38]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 8.63 s, sys: 0.11 s, total: 8.74 s
>>>> Wall time: 12.05 s
>>>>
>>>> In [40]: len(i)
>>>> Out[40]: 0
>>>>
>>>> In [41]: i = range(50000000)
>>>>
>>>> In [42]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 22.97 s, sys: 4.95 s, total: 27.92 s
>>>> Wall time: 78.04 s
>>>>
>>>> In [44]: i = []
>>>>
>>>> In [45]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 15.02 s, sys: 3.32 s, total: 18.34 s
>>>> Wall time: 52.45 s
>>>>
>>>> In [47]: gc.collect()
>>>> Out[47]: 3
>>>>
>>>> In [48]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
>>>> CPU times: user 15.81 s, sys: 4.08 s, total: 19.89 s
>>>> Wall time: 57.72 s
>>>>
>>>>
>>>> roman
>>>>
>>>> On Thu, Dec 22, 2011 at 2:26 PM, Benoit Thiell <[email protected]> 
>>>> wrote:
>>>>> Hi Jorban.
>>>>>
>>>>> Thanks for taking the time to advise on this issue. The query I used
>>>>> is not the one that caused problem originally, I just chose it because
>>>>> it returns a large number of results. So I can't really change
>>>>> anything on this side.
>>>>>
>>>>> Also the query runs fine by itself, without dictionaries loaded in
>>>>> memory. So I think that it would be nice to try and get a sense of how
>>>>> much big data structures loaded in memory impact large SQL queries.
>>>>>
>>>>> Benoit.
>>>>>
>>>>> On Thu, Dec 22, 2011 at 5:26 AM, Ferran Jorba <[email protected]> 
>>>>> wrote:
>>>>>> Hello Benoit,
>>>>>>
>>>>>> [...]
>>>>>>>     In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
>>>>>>> LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s
>>>>>>>     Wall time: 2.30 s
>>>>>>>
>>>>>>> Any idea about why we're seeing this and how we can fix it? It is
>>>>>>> quite a big problem for us as our citation dictionaries are so big.
>>>>>>
>>>>>> I have noticed in more than one case that for some minimally complex
>>>>>> (?!)  operations the bottleneck is MySQL, not Python, so if can move
>>>>>> part of the manipulation from one the other you have surprises.  I
>>>>>> cannot remember the exact case, but the equivalent with yours should be
>>>>>> changing:
>>>>>>
>>>>>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")
>>>>>>
>>>>>> to:
>>>>>>
>>>>>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x")
>>>>>>  res = res[:1000000]
>>>>>>
>>>>>> I remember gains of 10x.  YMMV, but you can try it.
>>>>>>
>>>>>> Ferran
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Benoit Thiell
>>>>> The SAO/NASA Astrophysics Data System
>>>>> http://adswww.harvard.edu/
>>>
>>>
>>>
>>> --
>>> Benoit Thiell
>>> The SAO/NASA Astrophysics Data System
>>> http://adswww.harvard.edu/
>
>
>
> --
> Benoit Thiell
> The SAO/NASA Astrophysics Data System
> http://adswww.harvard.edu/



-- 
Benoit Thiell
The SAO/NASA Astrophysics Data System
http://adswww.harvard.edu/

Reply via email to