Hi,
I have tried the same - could not reproduce with 1M rows, but with 10M
yes. It started eating too much memory and swap. The objects are not
reclaimed (and no garbage collector helps, not resetting the objects,
that could mean object outside of Python, int he c-extension -
however, after some time it got somewhat better; the 'somewhat' isn't
very precise, I know :)). I think it might be a mysqldb bug.

In [11]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 9.19 s, sys: 0.19 s, total: 9.38 s
Wall time: 12.55 s

In [13]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 8.65 s, sys: 0.11 s, total: 8.76 s
Wall time: 11.88 s

In [15]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 9.30 s, sys: 0.07 s, total: 9.37 s
Wall time: 12.52 s

In [17]: len(res)
Out[17]: 10000000

In [18]: i = range(50000000)

In [19]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 13.21 s, sys: 3.06 s, total: 16.27 s
Wall time: 71.10 s

In [21]: import gc

In [22]: gc.collect()
Out[22]: 3

In [23]: i = []

In [24]: gc.collect()
Out[24]: 3

In [25]: gc.isenabled()
Out[25]: True

In [26]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 14.48 s, sys: 4.38 s, total: 18.86 s
Wall time: 56.63 s

In [28]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 10.59 s, sys: 1.20 s, total: 11.79 s
Wall time: 24.57 s

In [30]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 10.28 s, sys: 0.30 s, total: 10.58 s
Wall time: 15.50 s

In [32]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 8.70 s, sys: 0.09 s, total: 8.79 s
Wall time: 12.17 s

In [34]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 9.79 s, sys: 0.26 s, total: 10.05 s
Wall time: 14.50 s

In [36]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 9.91 s, sys: 0.31 s, total: 10.22 s
Wall time: 15.92 s

In [38]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 8.63 s, sys: 0.11 s, total: 8.74 s
Wall time: 12.05 s

In [40]: len(i)
Out[40]: 0

In [41]: i = range(50000000)

In [42]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 22.97 s, sys: 4.95 s, total: 27.92 s
Wall time: 78.04 s

In [44]: i = []

In [45]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 15.02 s, sys: 3.32 s, total: 18.34 s
Wall time: 52.45 s

In [47]: gc.collect()
Out[47]: 3

In [48]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000")
CPU times: user 15.81 s, sys: 4.08 s, total: 19.89 s
Wall time: 57.72 s


roman

On Thu, Dec 22, 2011 at 2:26 PM, Benoit Thiell <[email protected]> wrote:
> Hi Jorban.
>
> Thanks for taking the time to advise on this issue. The query I used
> is not the one that caused problem originally, I just chose it because
> it returns a large number of results. So I can't really change
> anything on this side.
>
> Also the query runs fine by itself, without dictionaries loaded in
> memory. So I think that it would be nice to try and get a sense of how
> much big data structures loaded in memory impact large SQL queries.
>
> Benoit.
>
> On Thu, Dec 22, 2011 at 5:26 AM, Ferran Jorba <[email protected]> wrote:
>> Hello Benoit,
>>
>> [...]
>>>     In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x
>>> LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s
>>>     Wall time: 2.30 s
>>>
>>> Any idea about why we're seeing this and how we can fix it? It is
>>> quite a big problem for us as our citation dictionaries are so big.
>>
>> I have noticed in more than one case that for some minimally complex
>> (?!)  operations the bottleneck is MySQL, not Python, so if can move
>> part of the manipulation from one the other you have surprises.  I
>> cannot remember the exact case, but the equivalent with yours should be
>> changing:
>>
>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000")
>>
>> to:
>>
>>  res = run_sql("SELECT id_bibrec FROM bibrec_bib03x")
>>  res = res[:1000000]
>>
>> I remember gains of 10x.  YMMV, but you can try it.
>>
>> Ferran
>
>
>
> --
> Benoit Thiell
> The SAO/NASA Astrophysics Data System
> http://adswww.harvard.edu/

Reply via email to