Hello Marc, Thanks for looking at this and for your responses. If I understand you correctly, there is nothing to be done, which does not surprise me :-)
Best regards, Kern On Saturday 20 June 2009 13:16:44 Marc Cousin wrote: > > Subject: Re: [Bacula-users] Performance with MySQL queries since 3.0.0 > > (Dir inserting attributes hang) > > Date: Saturday 20 June 2009 > > From: Kern Sibbald <k...@sibbald.com> > > To: Tom Sommer <m...@tomsommer.dk> > > > > On Saturday 20 June 2009 08:51:53 Tom Sommer wrote: > > > Tom Sommer wrote: > > > > Mike Holden wrote: > > > >> Jari Fredriksson wrote: > > > >>>> INSERT INTO Filename( Name ) > > > >>>> SELECT a.Name > > > >>>> FROM ( > > > >>>> > > > >>>> SELECT DISTINCT Name > > > >>>> FROM batch > > > >>>> ) AS a > > > >>>> WHERE NOT > > > >>>> EXISTS ( > > > >>>> > > > >>>> SELECT Name > > > >>>> FROM Filename AS f > > > >>>> WHERE f.Name = a.Name > > > >>>> ) > > > >>> > > > >>> That looks silly. > > > >>> > > > >>> I would write it shorter as > > > >>> > > > >>> INSERT INTO Filename(Name) > > > >>> SELECT DISTINCT Name > > > >>> FROM batch AS a > > > >>> WHERE NOT EXISTS > > > >>> ( > > > >>> SELECT Name > > > >>> FROM Filename AS f > > > >>> WHERE f.Name = a.Name > > > >>> ) > > > >> > > > >> You may also want to consider using a JOIN rather than a subquery > > > >> with a NOT EXISTS, something like (untested and unvalidated!): > > > >> > > > >> INSERT INTO filename(name) > > > >> SELECT DISTINCT name > > > >> FROM batch AS a > > > >> LEFT JOIN filename AS f USING (name) > > > >> WHERE f.name IS NULL > > > >> > > > >> I know from experience as an Oracle DBA (my day job) that this can > > > >> often produce far more efficient results. > > > >> > > > >> Of course, all options need testing for both speed of execution and > > > >> resource usage, bearing in mind that data varies from one > > > >> installation to another, and one size may not fit all! > > > > > > > > Good suggestions, sounds like there might be an overall performance > > > > problem with the current query for batch-inserts with lots of data. > > > > I'm a bit unsure if I dare test these queries on my current > > > > installation. > > > > > > > > I'll CC Kern on the thread, perhaps he has some insights. > > > > > > So this morning I had to kill the above query because it's been running > > > for 24+ hours, preventing the new daily jobs from running. > > > I think I'm going to try and disable batch-inserts, the current > > > situation is simply not good enough, it's become a major headache to > > > run backups suddenly. When I hit the end of this month and ALL servers > > > have to run FULL backups, I'm gonna be in a world of trouble I think - > > > I just don't understand what has changed, because it's all been running > > > great up until now. > > > > > > // Tom > > > > Hello, > > > > We certainly can use help optimizing SQL since we are not DBAs, and we > > will look into the SQL optimization suggestions given above, keeping in > > mind that there are often rather radical differences in timing of > > particular SQL queries depending on the database engine used. > > > > To the best of my knowledge nothing has changed in terms of the Batch > > insert queries since when it was implemented, and it is *very* unlikely > > (though I haven't checked the code) that something changed from 2.4.4 to > > 3.0.x. > > > > More likely, your workload or MySQL has changed in some way -- e.g. more > > Jobs, more machines backed up, Director machine with less memory or other > > jobs that use memory, a new version of MySQL, ... > > > > - I would suggest that you ensure that your database has all the > > recommended indexes (see the make_mysql_tables file), and that you are > > running with the large memory /etc/my.cnf file. > > > > - Another thing to do is to compact your database. One way to do it is > > to write it to an ASCII file and then re-insert it. > > > > - If you are running certain programs that create and delete lots of > > temporary files with different names, you Filename table may need > > cleaning. > > > > - I would strongly recommend not starting *lots* of Full backups at the > > same time or on the same day. By lots, I mean more than say 10 or 20 > > (depends on the size of your system). It is generally far better to > > stage 1/4 of the backup every week for a full backup so that the peak > > workload is spread out over the month. > > > > - If the bottleneck is in MySQL, you might consider moving it to another > > machine that has more memory and faster disks. > > > > - If you really have a huge number of backups (say 50 or more) that all > > run at the same time, it might be advisable to consider using PostgreSQL, > > but in that case, you will probably need an onsite DBA to properly tune > > and maintain it. > > > > - Regardless of what hardware you have, there are certain limitations on > > how many simultaneous jobs you can run (this kicks in on many systems > > around 50). Once a certain number is exceeded, the total throughput can > > rather radically decrease so careful monitoring is necessary. Bweb can > > help a lot in these situations. > > > > - If you are really "in a world of trouble" and it is a performance > > issue, there is not much we (the Bacula project) can do for you other > > than the above tips. However, ($$) Bacula Systems has tools that can > > help more precisely identify bottlenecks and help balance loads. > > > > Regards, > > > > Kern > > > > ------------------------------------------------------- > > Sorry Kern, I wasn't monitoring this thread (mysql in the title :) ) > > I wrote these queries, so maybe I should explain (not why I'm right, I'm > not sure of that, but why we've done it this way). > > A few forewords : > - batch insert was created at the begining for postgresql. I don't have any > experience with mysql, except that knowing at the moment we programmed > batch insert it didn't really do anything but nested loops... > - The main reason for batch inserts with postgresql was to use COPY instead > of INSERT, which is much faster (we also had good results with stored > procedures, but batch insert was even faster) > - our current file table is nearly 200GB (1 billion rows), with a 250GB > database, and we insert around 50 million records a day (rough estimate) > > * First question I've seen in the thread: > > Why > > INSERT INTO Filename( Name ) > SELECT a.Name > FROM ( > SELECT DISTINCT Name > FROM batch > ) AS a > WHERE NOT EXISTS ( > SELECT Name > FROM Filename AS f > WHERE f.Name = a.Name > ) > > instead of the much simpler > > INSERT INTO Filename(Name) > SELECT DISTINCT Name > FROM batch AS a > WHERE NOT EXISTS > ( > SELECT Name > FROM Filename AS f > WHERE f.Name = a.Name > ) > > > => Because the first one removes the duplicate rows from batch table before > checking they exist in the filename table, while the second one checks > after. Performance was better in benchmarks with the first one (with > PostgreSQL, with our data). > > > * Second question : > Why not : > > INSERT INTO filename(name) > SELECT DISTINCT name > FROM batch AS a > LEFT JOIN filename AS f USING (name) > WHERE f.name IS NULL > > In fact, if the engine choses the right execution path, the execution plan > will be the roughly the same (nested loop) as > > INSERT INTO Filename(Name) > SELECT DISTINCT Name > FROM batch AS a > WHERE NOT EXISTS > ( > SELECT Name > FROM Filename AS f > WHERE f.Name = a.Name > ) > > Except if batch is close in size to filename (for instance on my bacula > database, it would be a 50 million files backup). Anyway, If i recall > correctly about mysql (and it hasn't changed recently), mysql will do a > nested loop for both, as it doesn't know how to merge or hash join ... > > > But I agree that this second syntax theorically gives more lattitude to the > engine. Except that recent oracle and postgresql versions will anyway be > able to convert between the exists/not exists and the join (or IN) in this > case, as they are semantically equivalent. > > > Now, I don't know how mysql would react to all these queries. An explain > plan would be great to understand... but I would have thought that the NOT > EXIST query would have been great for it's optimizer, as it tells mysql > exactly what to do :) > > > A last note, for postgreSQL users on this mailing list : > You can try to get better performance with batch inserts by lowering > seq_page_cost and random_page_cost, so that the engine favours nested loop > instead of hash join for the INSERT INTO File query : most of the time, > most of filename and path will be in the database or filesystem cache, and > the plan will be much faster (and save you big sorts) > > > Please tell me if you need more information. > > --------------------------------------------------------------------------- >--- Are you an open source citizen? Join us for the Open Source Bridge > conference! Portland, OR, June 17-19. Two days of sessions, one day of > unconference: $250. Need another reason to go? 24-hour hacker lounge. > Register today! > http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge. >org _______________________________________________ > Bacula-devel mailing list > Bacula-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bacula-devel ------------------------------------------------------------------------------ Are you an open source citizen? Join us for the Open Source Bridge conference! Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250. Need another reason to go? 24-hour hacker lounge. Register today! http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel