Hi Roman. Thanks for your test. Here are the latest news:
I tried with yet another MySQL bridge (https://launchpad.net/myconnpy) and got very similar results. This makes me think that it is not a problem that is specific to a Python MySQL bridge but maybe something lower level. So, on an advice from Alberto, I ran the following test which emulates a SQL request and returns a list of 1M tuples containing incrementing integers: In [1]: def run_sql(): ...: return [(i,) for i in range(1000000)] In [2]: %time res = run_sql() CPU times: user 0.94 s, sys: 0.05 s, total: 0.99 s Wall time: 0.99 s In [3]: i = range(50000000) In [4]: %time res = run_sql() CPU times: user 10.33 s, sys: 0.04 s, total: 10.37 s Wall time: 10.37 s In [5]: i = range(50000000) In [6]: %time res = run_sql() CPU times: user 11.41 s, sys: 0.00 s, total: 11.41 s Wall time: 11.41 s You can see that the creation of a data structure similar to the one returned by run_sql() takes significantly longer when a big list is already in memory. I don't quite understand why the third "SQL request" takes not much longer than the second. This is a very serious problem for us because it means that any SQL request will take a long time to return just because we have two gigantic citation dictionaries loaded in memory. So either we find a way to prevent this problem, or we need to find a way to not load the dictionaries in memory. Benoit. On Thu, Dec 22, 2011 at 4:00 PM, Roman Chyla <[email protected]> wrote: > Hi, > I have tried the same - could not reproduce with 1M rows, but with 10M > yes. It started eating too much memory and swap. The objects are not > reclaimed (and no garbage collector helps, not resetting the objects, > that could mean object outside of Python, int he c-extension - > however, after some time it got somewhat better; the 'somewhat' isn't > very precise, I know :)). I think it might be a mysqldb bug. > > In [11]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 9.19 s, sys: 0.19 s, total: 9.38 s > Wall time: 12.55 s > > In [13]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 8.65 s, sys: 0.11 s, total: 8.76 s > Wall time: 11.88 s > > In [15]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 9.30 s, sys: 0.07 s, total: 9.37 s > Wall time: 12.52 s > > In [17]: len(res) > Out[17]: 10000000 > > In [18]: i = range(50000000) > > In [19]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 13.21 s, sys: 3.06 s, total: 16.27 s > Wall time: 71.10 s > > In [21]: import gc > > In [22]: gc.collect() > Out[22]: 3 > > In [23]: i = [] > > In [24]: gc.collect() > Out[24]: 3 > > In [25]: gc.isenabled() > Out[25]: True > > In [26]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 14.48 s, sys: 4.38 s, total: 18.86 s > Wall time: 56.63 s > > In [28]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 10.59 s, sys: 1.20 s, total: 11.79 s > Wall time: 24.57 s > > In [30]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 10.28 s, sys: 0.30 s, total: 10.58 s > Wall time: 15.50 s > > In [32]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 8.70 s, sys: 0.09 s, total: 8.79 s > Wall time: 12.17 s > > In [34]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 9.79 s, sys: 0.26 s, total: 10.05 s > Wall time: 14.50 s > > In [36]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 9.91 s, sys: 0.31 s, total: 10.22 s > Wall time: 15.92 s > > In [38]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 8.63 s, sys: 0.11 s, total: 8.74 s > Wall time: 12.05 s > > In [40]: len(i) > Out[40]: 0 > > In [41]: i = range(50000000) > > In [42]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 22.97 s, sys: 4.95 s, total: 27.92 s > Wall time: 78.04 s > > In [44]: i = [] > > In [45]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 15.02 s, sys: 3.32 s, total: 18.34 s > Wall time: 52.45 s > > In [47]: gc.collect() > Out[47]: 3 > > In [48]: %time res = run_sql("SELECT id FROM testb LIMIT 10000000") > CPU times: user 15.81 s, sys: 4.08 s, total: 19.89 s > Wall time: 57.72 s > > > roman > > On Thu, Dec 22, 2011 at 2:26 PM, Benoit Thiell <[email protected]> > wrote: >> Hi Jorban. >> >> Thanks for taking the time to advise on this issue. The query I used >> is not the one that caused problem originally, I just chose it because >> it returns a large number of results. So I can't really change >> anything on this side. >> >> Also the query runs fine by itself, without dictionaries loaded in >> memory. So I think that it would be nice to try and get a sense of how >> much big data structures loaded in memory impact large SQL queries. >> >> Benoit. >> >> On Thu, Dec 22, 2011 at 5:26 AM, Ferran Jorba <[email protected]> wrote: >>> Hello Benoit, >>> >>> [...] >>>> In [4]: %time res = run_sql("SELECT id_bibrec FROM bibrec_bib03x >>>> LIMIT 1000000")CPU times: user 1.96 s, sys: 0.06 s, total: 2.02 s >>>> Wall time: 2.30 s >>>> >>>> Any idea about why we're seeing this and how we can fix it? It is >>>> quite a big problem for us as our citation dictionaries are so big. >>> >>> I have noticed in more than one case that for some minimally complex >>> (?!) operations the bottleneck is MySQL, not Python, so if can move >>> part of the manipulation from one the other you have surprises. I >>> cannot remember the exact case, but the equivalent with yours should be >>> changing: >>> >>> res = run_sql("SELECT id_bibrec FROM bibrec_bib03x LIMIT 1000000") >>> >>> to: >>> >>> res = run_sql("SELECT id_bibrec FROM bibrec_bib03x") >>> res = res[:1000000] >>> >>> I remember gains of 10x. YMMV, but you can try it. >>> >>> Ferran >> >> >> >> -- >> Benoit Thiell >> The SAO/NASA Astrophysics Data System >> http://adswww.harvard.edu/ -- Benoit Thiell The SAO/NASA Astrophysics Data System http://adswww.harvard.edu/

