Re: [Bacula-devel] Fwd: Re: [Bacula-users] Performance with MySQL queries since 3.0.0 (Dir inserting attributes hang)

Jason A. Kates Sun, 21 Jun 2009 08:36:01 -0700

Based on following the thread I recompile bacula with the following
options:
./configure --enable-smartalloc --with-mysql
--with-tcp-wrappers=/usr/lib/libwrap.so --enable-static-fd
--enable-static-dir --enable-static-sd --disable-batch-insert
And my 18+ hour backup ran in 9 hours 22 mins 27 secs last night.   It
seems to be back to the pre 3.0 time.


As for my tables I have an Admin job that optimizes my bacula tables
within mysql.  
Job {
        Name = optimize_table_JobMedia
        JobDefs = "DefaultJob"
        Type = Admin
        Client = bacula-local-fd
        RunBeforeJob = "/etc/bacula/optimize_table_JobMedia.sh"
        Priority = 40
        Cancel Queued Duplicates=yes
        Maximum Concurrent Jobs = 1
        Run Before Job = "/etc/bacula/check-duplicate-queued-job.sh '%n'
'%i'"
}

Please let me know what else I can do to provide useful data for the
improvement of bacula (Worlds best backup software!).
                             Thanks -Jason

On Sun, 2009-06-21 at 08:11 +0200, Marc Cousin wrote:
> First thing to know is which of the queries takes time (there are the 
> multiple 
> inserts into the batch table, then insert missing filenames/path, and the 
> final insert). Or is it slow with all the queries ? Which storage engine ?
> 
> There are two possibilities :
> 
> - either you have a bad plan for one of the queries (but I don't see how, 
> mysql having a very simple optimizer).
> - or you have a tuning problem with mysql (there I can't really help you : 
> not 
> enough memory ? fragmented indexes ? I don't really know how mysql works)
> 
> Kern, I have not been following closely the 3.0 development : do the xattr go 
> into the lstat field ?
> 
> On Sunday 21 June 2009 01:52:34 Jason A. Kates wrote:
> > I have a backup that runs for 3 hours then spends another 15 hours in
> > status "Dir inserting Attributes".
> >
> > The backup has 9,442,578 files and 239,718,461,832 bytes.   It was much
> > faster with bacula before and I thought that that the time change was
> > due to  additional data with xattrsupport that didn't exist before 3.0.
> > The job a spool job with spooled attributes too.
> >
> >
> >
> >
> > We are using mysql-server-5.0.45-7.el5.   I would be glad to collect and
> > share data.  Please let me know what you sould like to me set or do to
> > get the appropriate data.
> >
> >                              Thanks -Jason
> >
> > On Sat, 2009-06-20 at 17:48 +0200, Marc Cousin wrote:
> > > There may be something. But first we have to know why mysql behaves that
> > > slowly for Tom Summers. The problem being that I don't have a big mysql
> > > database to test on ... So as long as we don't have the query plan or an
> > > explanation ...
> > >
> > > On Saturday 20 June 2009 14:36:44 Kern Sibbald wrote:
> > > > Hello Marc,
> > > >
> > > > Thanks for looking at this and for your responses.  If I understand you
> > > > correctly, there is nothing to be done, which does not surprise me :-)
> > > >
> > > > Best regards,
> > > >
> > > > Kern
> > > >
> > > > On Saturday 20 June 2009 13:16:44 Marc Cousin wrote:
> > > > > > Subject: Re: [Bacula-users] Performance with MySQL queries since
> > > > > > 3.0.0 (Dir inserting attributes hang)
> > > > > > Date: Saturday 20 June 2009
> > > > > > From: Kern Sibbald <k...@sibbald.com>
> > > > > > To: Tom Sommer <m...@tomsommer.dk>
> > > > > >
> > > > > > On Saturday 20 June 2009 08:51:53 Tom Sommer wrote:
> > > > > > > Tom Sommer wrote:
> > > > > > > > Mike Holden wrote:
> > > > > > > >> Jari Fredriksson wrote:
> > > > > > > >>>> INSERT INTO Filename( Name )
> > > > > > > >>>> SELECT a.Name
> > > > > > > >>>> FROM (
> > > > > > > >>>>
> > > > > > > >>>> SELECT DISTINCT Name
> > > > > > > >>>> FROM batch
> > > > > > > >>>> ) AS a
> > > > > > > >>>> WHERE NOT
> > > > > > > >>>> EXISTS (
> > > > > > > >>>>
> > > > > > > >>>> SELECT Name
> > > > > > > >>>> FROM Filename AS f
> > > > > > > >>>> WHERE f.Name = a.Name
> > > > > > > >>>> )
> > > > > > > >>>
> > > > > > > >>> That looks silly.
> > > > > > > >>>
> > > > > > > >>> I would write it shorter as
> > > > > > > >>>
> > > > > > > >>> INSERT INTO Filename(Name)
> > > > > > > >>> SELECT DISTINCT Name
> > > > > > > >>> FROM batch AS a
> > > > > > > >>> WHERE NOT EXISTS
> > > > > > > >>> (
> > > > > > > >>>     SELECT Name
> > > > > > > >>>     FROM Filename AS f
> > > > > > > >>>     WHERE f.Name = a.Name
> > > > > > > >>> )
> > > > > > > >>
> > > > > > > >> You may also want to consider using a JOIN rather than a
> > > > > > > >> subquery with a NOT EXISTS, something like (untested and
> > > > > > > >> unvalidated!):
> > > > > > > >>
> > > > > > > >> INSERT INTO filename(name)
> > > > > > > >> SELECT DISTINCT name
> > > > > > > >> FROM batch AS a
> > > > > > > >> LEFT JOIN filename AS f USING (name)
> > > > > > > >> WHERE f.name IS NULL
> > > > > > > >>
> > > > > > > >> I know from experience as an Oracle DBA (my day job) that this
> > > > > > > >> can often produce far more efficient results.
> > > > > > > >>
> > > > > > > >> Of course, all options need testing for both speed of
> > > > > > > >> execution and resource usage, bearing in mind that data varies
> > > > > > > >> from one installation to another, and one size may not fit
> > > > > > > >> all!
> > > > > > > >
> > > > > > > > Good suggestions, sounds like there might be an overall
> > > > > > > > performance problem with the current query for batch-inserts
> > > > > > > > with lots of data. I'm a bit unsure if I dare test these
> > > > > > > > queries on my current installation.
> > > > > > > >
> > > > > > > > I'll CC Kern on the thread, perhaps he has some insights.
> > > > > > >
> > > > > > > So this morning I had to kill the above query because it's been
> > > > > > > running for 24+ hours, preventing the new daily jobs from
> > > > > > > running. I think I'm going to try and disable batch-inserts, the
> > > > > > > current situation is simply not good enough, it's become a major
> > > > > > > headache to run backups suddenly. When I hit the end of this
> > > > > > > month and ALL servers have to run FULL backups, I'm gonna be in a
> > > > > > > world of trouble I think - I just don't understand what has
> > > > > > > changed, because it's all been running great up until now.
> > > > > > >
> > > > > > > // Tom
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > We certainly can use help optimizing SQL since we are not DBAs, and
> > > > > > we will look into the SQL optimization suggestions given above,
> > > > > > keeping in mind that there are often rather radical differences in
> > > > > > timing of particular SQL queries depending on the database engine
> > > > > > used.
> > > > > >
> > > > > > To the best of my knowledge nothing has changed in terms of the
> > > > > > Batch insert queries since when it was implemented, and it is
> > > > > > *very* unlikely (though I haven't checked the code) that something
> > > > > > changed from 2.4.4 to 3.0.x.
> > > > > >
> > > > > > More likely, your workload or MySQL has changed in some way -- e.g.
> > > > > > more Jobs, more machines backed up, Director machine with less
> > > > > > memory or other jobs that use memory, a new version of MySQL, ...
> > > > > >
> > > > > > - I would suggest that you ensure that your database has all the
> > > > > > recommended indexes (see the make_mysql_tables file), and that you
> > > > > > are running with the large memory /etc/my.cnf file.
> > > > > >
> > > > > > - Another thing to do is to compact your database.  One way to do
> > > > > > it is to write it to an ASCII file and then re-insert it.
> > > > > >
> > > > > > - If you are running certain programs that create and delete lots
> > > > > > of temporary files with different names, you Filename table may
> > > > > > need cleaning.
> > > > > >
> > > > > > - I would strongly recommend not starting *lots* of Full backups at
> > > > > > the same time or on the same day.  By lots, I mean more than say 10
> > > > > > or 20 (depends on the size of your system).  It is generally far
> > > > > > better to stage 1/4 of the backup every week for a full backup so
> > > > > > that the peak workload is spread out over the month.
> > > > > >
> > > > > > - If the bottleneck is in MySQL, you might consider moving it to
> > > > > > another machine that has more memory and faster disks.
> > > > > >
> > > > > > - If you really have a huge number of backups (say 50 or more) that
> > > > > > all run at the same time, it might be advisable to consider using
> > > > > > PostgreSQL, but in that case, you will probably need an onsite DBA
> > > > > > to properly tune and maintain it.
> > > > > >
> > > > > > - Regardless of what hardware you have, there are certain
> > > > > > limitations on how many simultaneous jobs you can run (this kicks
> > > > > > in on many systems around 50). Once a certain number is exceeded,
> > > > > > the total throughput can rather radically decrease so careful
> > > > > > monitoring is necessary.  Bweb can help a lot in these situations.
> > > > > >
> > > > > > - If you are really "in a world of trouble" and it is a performance
> > > > > > issue, there is not much we (the Bacula project) can do for you
> > > > > > other than the above tips.  However, ($$) Bacula Systems has tools
> > > > > > that can help more precisely identify bottlenecks and help balance
> > > > > > loads.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Kern
> > > > > >
> > > > > > -------------------------------------------------------
> > > > >
> > > > > Sorry Kern, I wasn't monitoring this thread (mysql in the title  :) )
> > > > >
> > > > > I wrote these queries, so maybe I should explain (not why I'm right,
> > > > > I'm not sure of that, but why we've done it this way).
> > > > >
> > > > > A few forewords :
> > > > > - batch insert was created at the begining for postgresql. I don't
> > > > > have any experience with mysql, except that knowing at the moment we
> > > > > programmed batch insert it didn't really do anything but nested
> > > > > loops... - The main reason for batch inserts with postgresql was to
> > > > > use COPY instead of INSERT, which is much faster (we also had good
> > > > > results with stored procedures, but batch insert was even faster)
> > > > > - our current file table is nearly 200GB (1 billion rows), with a
> > > > > 250GB database, and we insert around 50 million records a day (rough
> > > > > estimate)
> > > > >
> > > > > * First question I've seen in the thread:
> > > > >
> > > > > Why
> > > > >
> > > > > INSERT INTO Filename( Name )
> > > > > SELECT a.Name
> > > > > FROM (
> > > > >       SELECT DISTINCT Name
> > > > >       FROM batch
> > > > > ) AS a
> > > > > WHERE NOT EXISTS (
> > > > >       SELECT Name
> > > > >       FROM Filename AS f
> > > > >       WHERE f.Name = a.Name
> > > > > )
> > > > >
> > > > > instead of the much simpler
> > > > >
> > > > > INSERT INTO Filename(Name)
> > > > > SELECT DISTINCT Name
> > > > > FROM batch AS a
> > > > > WHERE NOT EXISTS
> > > > > (
> > > > >       SELECT Name
> > > > >       FROM Filename AS f
> > > > >       WHERE f.Name = a.Name
> > > > > )
> > > > >
> > > > >
> > > > > => Because the first one removes the duplicate rows from batch table
> > > > > before checking they exist in the filename table, while the second
> > > > > one checks after. Performance was better in benchmarks with the first
> > > > > one (with PostgreSQL, with our data).
> > > > >
> > > > >
> > > > > * Second question :
> > > > > Why not :
> > > > >
> > > > > INSERT INTO filename(name)
> > > > > SELECT DISTINCT name
> > > > > FROM batch AS a
> > > > > LEFT JOIN filename AS f USING (name)
> > > > > WHERE f.name IS NULL
> > > > >
> > > > > In fact, if the engine choses the right execution path, the execution
> > > > > plan will be the roughly the same (nested loop) as
> > > > >
> > > > > INSERT INTO Filename(Name)
> > > > > SELECT DISTINCT Name
> > > > > FROM batch AS a
> > > > > WHERE NOT EXISTS
> > > > > (
> > > > > SELECT Name
> > > > > FROM Filename AS f
> > > > > WHERE f.Name = a.Name
> > > > > )
> > > > >
> > > > > Except if batch is close in size to filename (for instance on my
> > > > > bacula database, it would be a 50 million files backup). Anyway, If i
> > > > > recall correctly about mysql (and it hasn't changed recently), mysql
> > > > > will do a nested loop for both, as it doesn't know how to merge or
> > > > > hash join ...
> > > > >
> > > > >
> > > > > But I agree that this second syntax theorically gives more lattitude
> > > > > to the engine. Except that recent oracle and postgresql versions will
> > > > > anyway be able to convert between the exists/not exists and the join
> > > > > (or IN) in this case, as they are semantically equivalent.
> > > > >
> > > > >
> > > > > Now, I don't know how mysql would react to all these queries. An
> > > > > explain plan would be great to understand... but I would have thought
> > > > > that the NOT EXIST query would have been great for it's optimizer, as
> > > > > it tells mysql exactly what to do :)
> > > > >
> > > > >
> > > > > A last note, for postgreSQL users on this mailing list :
> > > > > You can try to get better performance with batch inserts by lowering
> > > > > seq_page_cost and random_page_cost, so that the engine favours nested
> > > > > loop instead of hash join for the INSERT INTO File query : most of
> > > > > the time, most of filename and path will be in the database or
> > > > > filesystem cache, and the plan will be much faster (and save you big
> > > > > sorts)
> > > > >
> > > > >
> > > > > Please tell me if you need more information.
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > >---- -- --- Are you an open source citizen? Join us for the Open
> > > > > Source Bridge conference! Portland, OR, June 17-19. Two days of
> > > > > sessions, one day of unconference: $250. Need another reason to go?
> > > > > 24-hour hacker lounge. Register today!
> > > > > http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourceb
> > > > >ridg e. org _______________________________________________
> > > > > Bacula-devel mailing list
> > > > > Bacula-devel@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/bacula-devel
> > >
> > > -------------------------------------------------------------------------
> > >----- Are you an open source citizen? Join us for the Open Source Bridge
> > > conference! Portland, OR, June 17-19. Two days of sessions, one day of
> > > unconference: $250. Need another reason to go? 24-hour hacker lounge.
> > > Register today!
> > > http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridg
> > >e.org _______________________________________________
> > > Bacula-devel mailing list
> > > Bacula-devel@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/bacula-devel
> 
> 

-- 
----------------------------------------------------------------------------
Jason A. Kates (ja...@kates.org) 
Fax:    208-975-1514
Phone:  212-400-1670 x2
============================================================================



------------------------------------------------------------------------------
Are you an open source citizen? Join us for the Open Source Bridge conference!
Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
Need another reason to go? 24-hour hacker lounge. Register today!
http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Fwd: Re: [Bacula-users] Performance with MySQL queries since 3.0.0 (Dir inserting attributes hang)

Reply via email to