Hi Roman.

Thanks for your test. Here are the latest news:

I tried with yet another MySQL bridge (https://launchpad.net/myconnpy)
and got very similar results. This makes me think that it is not a
problem that is specific to a Python MySQL bridge but maybe something
lower level. So, on an advice from Alberto, I ran the following test
which emulates a SQL request and returns a list of 1M tuples
containing incrementing integers:

    In [1]: def run_sql():
       ...:     return [(i,) for i in range(1000000)]

    In [2]: %time res = run_sql()
    CPU times: user 0.94 s, sys: 0.05 s, total: 0.99 s
    Wall time: 0.99 s

    In [3]: i = range(50000000)

    In [4]: %time res = run_sql()
    CPU times: user 10.33 s, sys: 0.04 s, total: 10.37 s
    Wall time: 10.37 s

    In [5]: i = range(50000000)

    In [6]: %time res = run_sql()
    CPU times: user 11.41 s, sys: 0.00 s, total: 11.41 s
    Wall time: 11.41 s

You can see that the creation of a data structure similar to the one
returned by run_sql() takes significantly longer when a big list is
already in memory. I don't quite understand why the third "SQL
request" takes not much longer than the second.

This is a very serious problem for us because it means that any SQL
request will take a long time to return just because we have two
gigantic citation dictionaries loaded in memory. So either we find a
way to prevent this problem, or we need to find a way to not load the
dictionaries in memory.

Benoit.

On Thu, Dec 22, 2011 at 4:00 PM, Roman Chyla <[email protected]> wrote:
> Hi,
> I have tried the same - could not reproduce with 1M rows, but with 10M
> yes. It started eating too much memory and swap. The objects are not
> reclaimed (and no garbage collector helps, not resetting the objects,
> that could mean object outside of Python, int he c-extension -
> however, after some time it got somewhat better; the 'somewhat' isn't
> very precise, I know :)). I think it might be a mysqldb bug.
>
> In [11]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 9.19 s, sys: 0.19 s, total: 9.38 s
> Wall time: 12.55 s
>
> In [13]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 8.65 s, sys: 0.11 s, total: 8.76 s
> Wall time: 11.88 s
>
> In [15]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 9.30 s, sys: 0.07 s, total: 9.37 s
> Wall time: 12.52 s
>
> In [17]: len(res)
> Out[17]: 10000000
>
> In [18]: i = range(50000000)
>
> In [19]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 13.21 s, sys: 3.06 s, total: 16.27 s
> Wall time: 71.10 s
>
> In [21]: import gc
>
> In [22]: gc.collect()
> Out[22]: 3
>
> In [23]: i = []
>
> In [24]: gc.collect()
> Out[24]: 3
>
> In [25]: gc.isenabled()
> Out[25]: True
>
> In [26]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 14.48 s, sys: 4.38 s, total: 18.86 s
> Wall time: 56.63 s
>
> In [28]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 10.59 s, sys: 1.20 s, total: 11.79 s
> Wall time: 24.57 s
>
> In [30]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 10.28 s, sys: 0.30 s, total: 10.58 s
> Wall time: 15.50 s
>
> In [32]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 8.70 s, sys: 0.09 s, total: 8.79 s
> Wall time: 12.17 s
>
> In [34]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 9.79 s, sys: 0.26 s, total: 10.05 s
> Wall time: 14.50 s
>
> In [36]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 9.91 s, sys: 0.31 s, total: 10.22 s
> Wall time: 15.92 s
>
> In [38]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 8.63 s, sys: 0.11 s, total: 8.74 s
> Wall time: 12.05 s
>
> In [40]: len(i)
> Out[40]: 0
>
> In [41]: i = range(50000000)
>
> In [42]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 22.97 s, sys: 4.95 s, total: 27.92 s
> Wall time: 78.04 s
>
> In [44]: i = []
>
> In [45]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 15.02 s, sys: 3.32 s, total: 18.34 s
> Wall time: 52.45 s
>
> In [47]: gc.collect()
> Out[47]: 3
>
> In [48]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
> CPU times: user 15.81 s, sys: 4.08 s, total: 19.89 s
> Wall time: 57.72 s
>
>
> roman
>
> On Thu, Dec 22, 2011 at 2:26 PM, Benoit Thiell <[email protected]> 
> wrote:
>> Hi Jorban.
>>
>> Thanks for taking the time to advise on this issue. The query I used
>> is not the one that caused problem originally, I just chose it because
>> it returns a large number of results. So I can't really change
>> anything on this side.
>>
>> Also the query runs fine by itself, without dictionaries loaded in
>> memory. So I think that it would be nice to try and get a sense of how
>> much big data structures loaded in memory impact large SQL queries.
>>
>> Benoit.
>>
>> On Thu, Dec 22, 2011 at 5:26 AM, Ferran Jorba <[email protected]> wrote:
>>> Hello Benoit,
>>>
>>> [...]
>>>>     In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
>>>> LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s
>>>>     Wall time: 2.30 s
>>>>
>>>> Any idea about why we're seeing this and how we can fix it? It is
>>>> quite a big problem for us as our citation dictionaries are so big.
>>>
>>> I have noticed in more than one case that for some minimally complex
>>> (?!)  operations the bottleneck is MySQL, not Python, so if can move
>>> part of the manipulation from one the other you have surprises.  I
>>> cannot remember the exact case, but the equivalent with yours should be
>>> changing:
>>>
>>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")
>>>
>>> to:
>>>
>>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x")
>>>  res = res[:1000000]
>>>
>>> I remember gains of 10x.  YMMV, but you can try it.
>>>
>>> Ferran
>>
>>
>>
>> --
>> Benoit Thiell
>> The SAO/NASA Astrophysics Data System
>> http://adswww.harvard.edu/



-- 
Benoit Thiell
The SAO/NASA Astrophysics Data System
http://adswww.harvard.edu/

Reply via email to