Re: [Bacula-devel] small dbcheck patch for mysql

Eric Bollengier Wed, 10 Sep 2008 01:41:56 -0700

Le Tuesday 09 September 2008 23:15:45 Bill Moran, vous avez écrit :
> In response to "Yuri Timofeev" <[EMAIL PROTECTED]>:
> > 2008/9/9 Bill Moran <[EMAIL PROTECTED]>:
> > > In response to Bob Hetzel <[EMAIL PROTECTED]>:
> > >.  Additionally, doing that
> > > rewrite so that it results in improvements to all SQL platforms is
> > > probably going to require that the code be broken into SQL-dependent
> > > and SQL-independent sections.
> >
> > It may be that in the critical sections have to do it.
>
> Agreed.  This was demonstrated last year by some massive speed improvements
> in the way that job records are saved.
>
> > The problem is that different DBMS have different optimizers.
> > For example, in PostgreSQL (during my tests) operator EXPLAIN does not
> > show that will be used indices  :
> >
> > bacula=# EXPLAIN SELECT File.FileId, Job.JobId FROM File LEFT OUTER JOIN
> > Job ON (File.JobId = Job.JobId) WHERE Job.JobId IS NULL LIMIT 300000;
> >
> >                                 QUERY PLAN
> > -------------------------------------------------------------------------
> >-- Limit  (cost=78.86..24169.76 rows=300000 width=8)
> >    ->  Hash Left Join  (cost=78.86..181966.90 rows=2265021 width=8)
> >          Hash Cond: (file.jobid = job.jobid)
> >          Filter: (job.jobid IS NULL)
> >          ->  Seq Scan on file  (cost=0.00..113937.42 rows=4530042
> > width=8) ->  Hash  (cost=58.38..58.38 rows=1638 width=4)
> >                ->  Seq Scan on job  (cost=0.00..58.38 rows=1638 width=4)
> > (7 rows)
> >
> > However, PostgreSQL worked quickly.
>
> That's because using indexes on that query is a horrifically bad idea:
>
> bacula=# EXPLAIN SELECT File.FileId, Job.JobId FROM File LEFT OUTER JOIN
> Job ON bacula-#   (File.JobId = Job.JobId) WHERE Job.JobId IS NULL LIMIT
> 300000; QUERY PLAN
> ---------------------------------------------------------------------------
>-- Limit  (cost=53.15..14650.94 rows=300000 width=8)
>    ->  Hash Left Join  (cost=53.15..2115902.45 rows=43482932 width=8)
>          Hash Cond: ("outer".jobid = "inner".jobid)
>          Filter: ("inner".jobid IS NULL)
>          ->  Seq Scan on file  (cost=0.00..1463605.32 rows=43482932
> width=8) ->  Hash  (cost=50.52..50.52 rows=1052 width=4)
>                ->  Seq Scan on job  (cost=0.00..50.52 rows=1052 width=4)
> (7 rows)
>
> bacula=# set enable_seqscan=off;
> SET
> Time: 0.210 ms
> bacula=# EXPLAIN SELECT File.FileId, Job.JobId FROM File LEFT OUTER JOIN
> Job ON bacula-#   (File.JobId = Job.JobId) WHERE Job.JobId IS NULL LIMIT
> 300000; QUERY PLAN
> ---------------------------------------------------------------------------
>--------------------------- Limit  (cost=0.00..1042544.60 rows=300000
> width=8)
>    ->  Merge Right Join  (cost=0.00..151109653.30 rows=43482932 width=8)
>          Merge Cond: ("outer".jobid = "inner".jobid)
>          Filter: ("outer".jobid IS NULL)
>          ->  Index Scan using job_pkey on job  (cost=0.00..175.80 rows=1052
> width=4) ->  Index Scan using file_jobid_idx on file 
> (cost=0.00..150565938.23 rows=43482932 width=8) (6 rows)
>
> Notice that PostgreSQL expects the index scan on file_job_idx to take
> 150,565,938 operations along, whereas the entire sequential scan only
> takes 1,463,605.  The seq scans are a clear winner in this case.
>
> > MySQL (theoretically) should use an index (multiple columns), which
> > already have:
> > in make_mysql_tables.in
> > CREATE TABLE File (...
> > INDEX (JobId, PathId, FilenameId)
> >
> > but MySQL worked slowly.
>
> I don't know much about the innards of MySQL's query planner, but it
> doesn't seem to be as robust as Postgres'.  Index scanning is usually
> the best way to go, but (as is shown above) not always.  I have trouble
> believing that the MySQL developers always use available indexes as
> that would create a laundry list of corner cases where queries perform
> horribly, but it seems that their planning logic fails in this particular
> case.
>
> It's not unusual.  If you mess with Postgres long enough, you'll find
> odd queries that perform poorly even though there's a much faster way
> to do it.  And that's the rub.  If I were writing a system purely for
> PostgreSQL, I could reorder joins or make other tweaks to the query
> until it ran quickly, but then I might have a query that ran like
> molasses on MySQL.  The inverse is true as well.
>
> My point is that such work is definitely warranted in the director, I'm
> not sure if it's warranted in something like dbcheck.  If you've already
> got it done, there's no reason to waste it, but suggesting that someone
> review dbcheck and optimize the SQL is like asking me to clean toilets
> with a toothbrush -- it's a useful exercise to teach privates not to
> mouth off to sergeants, but it's not a particularly efficient way to get
> the toilet clean.
>
> I expect, that on Postgres, the fastest way to delete orphaned records
> would be:
> delete from filename
>  where filename.filenameid not in (select filenameid from file);
>
> Assuming that is fast, would such a query even work on MySQL?


We (essentially Marc Cousin) have already done some optimizations for 
postgresql, you will be able to find a dbcheck version in pure SQL in
trunk/bacula/examples/database/dbcheck.sql

It's *pretty* fast with postgresql because the engine is able to do merge or 
hash join. Mysql uses nested loop (a loop within a loop) instead and needs 
indexes.

Bye

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] small dbcheck patch for mysql

Reply via email to