Re: [Bacula-devel] Faster Batch table processing - PostgreSQL

Dan Langille Thu, 11 Feb 2016 15:12:43 -0800

> On Jan 11, 2016, at 3:03 PM, Dan Langille <d...@langille.org> wrote:
> 
> 
>> On Jan 11, 2016, at 8:35 AM, Eric Bollengier <e...@baculasystems.com 
>> <mailto:e...@baculasystems.com>> wrote:
>> 
>> Hello Dan,
>> 
>>> 
>>> I have a test job with 15,764,45 rows in the Batch table.
>>> 
>>> With a change in SQL, I went from 45 minutes to 15 minutes.
>>> 
>>> Old SQL:
>>> 
>>>  SELECT batch_testing.FileIndex, batch_testing.JobId, Path.PathId,
>>> Filename.FilenameId, batch_testing.LStat,
>>>         batch_testing.MD5, batch_testing.DeltaSeq
>>>    FROM batch_testing JOIN Path     ON (batch_testing.Path = Path.Path)
>>>                       JOIN Filename ON (batch_testing.Name =
>>> Filename.Name);
>>> 
>>> 
>>> New SQL:
>>> 
>>> explain analyze
>>> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
>>> DeltaSeq)
>>>  SELECT BT.FileIndex,
>>>         BT.JobId,
>>>         (SELECT Path.PathId         FROM Path     WHERE Path.path
>>> = BT.Path),
>>>         (SELECT Filename.FilenameId FROM Filename WHERE Filename.name
>>> = BT.Name) ,
>>>         BT.LStat,
>>>         BT.MD5,
>>>         BT.DeltaSeq
>>>    FROM batch_testing BT;
>>> 
>>> NOTE: I have used the table Batch_Testing... because I was testing. :)
>>> 
>> 
>> Thanks for your feedback.
>> 
>> The difference makes me think that it might be a cache issue. Can you
>> give us the detailed execution plan for both queries ? explain (analyze
>> and buffers) and your server settings? That might be probably done with
>> pg_autoexplain I think.
> 
> The explain analyze output is in the URL:  
> https://gist.github.com/dlangille/58a414a260f4c6706d7e 
> <https://gist.github.com/dlangille/58a414a260f4c6706d7e>
> 
> I also just added postgresql.conf


I suggest we ignore what I was trying above.

I am still looking at this, mostly because of this issue:

07-Feb 18:01 crey-sd JobId 230447: Sending spooled attrs to the Director. 
Despooling 422,031,119 bytes ...
08-Feb 01:31 bacula-dir JobId 230447: Bacula bacula-dir 7.4.0 (16Jan16):

FD Files Written:       1,365,074
FD Bytes Written:       332,924,576,236 (332.9 GB)

The above is the original job.  The following is the copy to tape:

09-Feb 00:24 crey-sd JobId 230519: Sending spooled attrs to the Director. 
Despooling 422,031,119 bytes ...
09-Feb 06:57 bacula-dir JobId 230516: Bacula bacula-dir 7.4.0 (16Jan16):

SD Files Written:       1,365,074
SD Bytes Written:       333,216,925,325 (333.2 GB)

The first job spends about 5.5 hours getting attributes into the database. The 
second job spends about 6.5 hours.

I did some testing, first by creating and populating a batch_testing table:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
INSERT INTO batch_testing (fileindex, jobid, path, name, lstat, md5, deltaseq)
SELECT F.fileindex,
       666,
       (SELECT P.path  FROM path P      WHERE P.pathid      = F.pathid),
       (SELECT FN.name FROM filename FN WHERE FN.filenameid = F.filenameid),
       F.lstat,
       F.md5,
       F.deltaseq
  FROM file F
 WHERE F.jobid = 230519;

That took about 45 seconds

Then I did the insert into the File table:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5, DeltaSeq)
  SELECT BT.FileIndex, BT.JobId, P.PathId, FN.FilenameId, BT.LStat, BT.MD5, 
BT.DeltaSeq
    FROM batch_testing BT JOIN Path     P  ON (BT.Path = P.Path)
                          JOIN Filename FN ON (BT.Name = FN.Name);

That took about 122 minutes.  Let's say 2 hours.

How do we account for the other 3.5+ hours taken by the jobs? Is that 
population of the Batch table?

Do we need a more efficient method for putting the data into the db?

Should I be looking at sql_batch_insert in cats/postgresql.c?

I know it's not batch_fill_path_query or batch_fill_filename_query (see 
cats/sql_cmds.c).  I ran tests on that. See the two photos at 
https://twitter.com/DLangille/status/697920332741660675 
<https://twitter.com/DLangille/status/697920332741660675>

FYI the above plans were created with this visual planner: 
http://tatiyants.com/pev/#/plans <http://tatiyants.com/pev/#/plans>

Prefix your queries with the EXPLAIN I show above, then paste the JSON into the 
plan.

re: https://twitter.com/DLangille/status/697860500386947073

I have reconfirmed that my install has batch insert enabled 
(https://services.unixathome.org/poudriere/data/102amd64-default-master-list/latest-per-pkg/bacula-server-7.4.0.log
 
<https://services.unixathome.org/poudriere/data/102amd64-default-master-list/latest-per-pkg/bacula-server-7.4.0.log>)

--
Dan Langille - BSDCan / PGCon
d...@langille.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Faster Batch table processing - PostgreSQL

Reply via email to