Re: [Bacula-users] Catalogue snapshot utility : any interest?

2010-10-04 Thread Rory Campbell-Lange
On 04/10/10, James Harper (james.har...@bendigoit.com.au) wrote:
  On 04/10/10, James Harper (james.har...@bendigoit.com.au) wrote:
  A full pg_dump of the catalogue is 2.8G. The output of the catalogue
  snapshot for job 60 is 1.6G. Naturally, the full pg_dump of the
  whole database will continue to grow over time.

  I'm a little suprised that the proportion of job 60 to the whole is
  so high. Job 60 is similar to job 1, but I don't expect they share
  much information. I'll have to look into that.
 
 If jobid 60 and job 1 were the same backup job then a lot of the
 information may be shared in the filename table. Even if they are
 backups of similar servers then they will share a lot of filename data
 and that filename data has to come with the extracted catalogue so you
 might not be saving that much.

My backups are all full backups. Also, the key file table in postgres (which
joins files and paths) is job specific, so I'm not sure where any
duplication is emanating from.

Regards
Rory

 Table public.file
   Column   |  Type   |   Modifiers   
+-+---
 fileid | bigint  | not null default nextval('file_fileid_seq'::regclass)
 fileindex  | integer | not null default 0
 jobid  | integer | not null
 pathid | integer | not null
 filenameid | integer | not null
 markid | integer | not null default 0
 lstat  | text| not null
 md5| text| not null


-- 
Rory Campbell-Lange
r...@campbell-lange.net

Campbell-Lange Workshop
www.campbell-lange.net
0207 6311 555
3 Tottenham Street London W1T 2AF
Registered in England No. 04551928

--
Virtualization is moving to the mainstream and overtaking non-virtualized
environment for deploying applications. Does it make network security 
easier or more difficult to achieve? Read this whitepaper to separate the 
two and get a better understanding.
http://p.sf.net/sfu/hp-phase2-d2d
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catalogue snapshot utility : any interest?

2010-10-04 Thread James Harper
 On 04/10/10, James Harper (james.har...@bendigoit.com.au) wrote:
  
   I have developed a catalogue snapshot facility in python to
snapshot
   one job's catalogue and dump it to disk.
 ...
  How much smaller is the catalogue subset vs the full catalogue?
 
 Good question.
 
 I'm not able to answer that question fully at present as I don't have
enough
 jobs in my current database to know.
 
 My currrent database has the following jobs in it:
 
  jobid | jobfiles | jobgigs
 ---+--+-
  1 |  7706717 | 6833.90
  8 |  3965507 | 4480.83
  9 |  1273459 |  129.87
 50 |   646336 |  512.07
 60 |  7845561 | 6990.67
 
 A full pg_dump of the catalogue is 2.8G. The output of the catalogue
snapshot
 for job 60 is 1.6G. Naturally, the full pg_dump of the whole database
will
 continue to grow over time.
 
 (The job 60 cataloge file compresses to about 300MB with bzip2 -9).
 
 I'm a little suprised that the proportion of job 60 to the whole is so
high.
 Job 60 is similar to job 1, but I don't expect they share much
information.
 I'll have to look into that.
 

If jobid 60 and job 1 were the same backup job then a lot of the
information may be shared in the filename table. Even if they are
backups of similar servers then they will share a lot of filename data
and that filename data has to come with the extracted catalogue so you
might not be saving that much.

James


--
Virtualization is moving to the mainstream and overtaking non-virtualized
environment for deploying applications. Does it make network security 
easier or more difficult to achieve? Read this whitepaper to separate the 
two and get a better understanding.
http://p.sf.net/sfu/hp-phase2-d2d
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catalogue snapshot utility : any interest?

2010-10-04 Thread Phil Stracchino
On 10/04/10 07:22, Rory Campbell-Lange wrote:
 I have developed a catalogue snapshot facility in python to snapshot one
 job's catalogue and dump it to disk. The snapshot provides a bacula
 database schema file, a database dump of the job's data, and a file
 listing of files showing info such as the tape number, path, file, md5
 and lstat.
 
 We intend to include the catalogue in compressed format on CDs
 accompanying tape sets to assist our clients retrieve data in future if
 required.
 
 At present the system works only for Postgresql, and for our setup which
 has the director, storage and file daemons on the same Linux server.
 
 How it works:
 
 * A temporary schema is made in postgres, named job_%d % (jobid)
 * Relevant data is selected from the public schema to the temporary
   schema
 * The file listing is ouput
 * The public schema is dumped
 * The temporary schema is dumped
 * The temporary schema is removed
 
 I'm considering making an sqlite database from the temporary schema to
 obviate the need for the public schema file and file listing.
 
 This is fairly simple stuff, but if this functionality is useful to you,
 do let me know and I can share the programme with you.


This sounds like a useful tool for any Bacula site that's managing
Bacula backups for a large number of clients.


-- 
  Phil Stracchino, CDK#2 DoD#299792458 ICBM: 43.5607, -71.355
  ala...@caerllewys.net   ala...@metrocast.net   p...@co.ordinate.org
 Renaissance Man, Unix ronin, Perl hacker, Free Stater
 It's not the years, it's the mileage.

--
Virtualization is moving to the mainstream and overtaking non-virtualized
environment for deploying applications. Does it make network security 
easier or more difficult to achieve? Read this whitepaper to separate the 
two and get a better understanding.
http://p.sf.net/sfu/hp-phase2-d2d
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catalogue snapshot utility : any interest?

2010-10-04 Thread Rory Campbell-Lange
On 04/10/10, James Harper (james.har...@bendigoit.com.au) wrote:
  
  I have developed a catalogue snapshot facility in python to snapshot
  one job's catalogue and dump it to disk. 
...
 How much smaller is the catalogue subset vs the full catalogue?

Good question.

I'm not able to answer that question fully at present as I don't have enough
jobs in my current database to know.

My currrent database has the following jobs in it:

 jobid | jobfiles | jobgigs 
---+--+-
 1 |  7706717 | 6833.90
 8 |  3965507 | 4480.83
 9 |  1273459 |  129.87
50 |   646336 |  512.07
60 |  7845561 | 6990.67

A full pg_dump of the catalogue is 2.8G. The output of the catalogue snapshot
for job 60 is 1.6G. Naturally, the full pg_dump of the whole database will
continue to grow over time.

(The job 60 cataloge file compresses to about 300MB with bzip2 -9).

I'm a little suprised that the proportion of job 60 to the whole is so high.
Job 60 is similar to job 1, but I don't expect they share much information.
I'll have to look into that.

Regards
Rory

-- 
Rory Campbell-Lange
r...@campbell-lange.net

Campbell-Lange Workshop
www.campbell-lange.net
0207 6311 555
3 Tottenham Street London W1T 2AF
Registered in England No. 04551928

--
Virtualization is moving to the mainstream and overtaking non-virtualized
environment for deploying applications. Does it make network security 
easier or more difficult to achieve? Read this whitepaper to separate the 
two and get a better understanding.
http://p.sf.net/sfu/hp-phase2-d2d
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


[Bacula-users] Catalogue snapshot utility : any interest?

2010-10-04 Thread Rory Campbell-Lange
I have developed a catalogue snapshot facility in python to snapshot one
job's catalogue and dump it to disk. The snapshot provides a bacula
database schema file, a database dump of the job's data, and a file
listing of files showing info such as the tape number, path, file, md5
and lstat.

We intend to include the catalogue in compressed format on CDs
accompanying tape sets to assist our clients retrieve data in future if
required.

At present the system works only for Postgresql, and for our setup which
has the director, storage and file daemons on the same Linux server.

How it works:

* A temporary schema is made in postgres, named job_%d % (jobid)
* Relevant data is selected from the public schema to the temporary
  schema
* The file listing is ouput
* The public schema is dumped
* The temporary schema is dumped
* The temporary schema is removed

I'm considering making an sqlite database from the temporary schema to
obviate the need for the public schema file and file listing.

This is fairly simple stuff, but if this functionality is useful to you,
do let me know and I can share the programme with you.

Regards
Rory

-- 
Rory Campbell-Lange
r...@campbell-lange.net

Campbell-Lange Workshop
www.campbell-lange.net
0207 6311 555
3 Tottenham Street London W1T 2AF
Registered in England No. 04551928

--
Virtualization is moving to the mainstream and overtaking non-virtualized
environment for deploying applications. Does it make network security 
easier or more difficult to achieve? Read this whitepaper to separate the 
two and get a better understanding.
http://p.sf.net/sfu/hp-phase2-d2d
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catalogue snapshot utility : any interest?

2010-10-04 Thread Rory Campbell-Lange
On 04/10/10, Phil Stracchino (ala...@metrocast.net) wrote:
 On 10/04/10 07:22, Rory Campbell-Lange wrote:
  I have developed a catalogue snapshot facility in python to snapshot one
  job's catalogue and dump it to disk. The snapshot provides a bacula
  database schema file, a database dump of the job's data, and a file
  listing of files showing info such as the tape number, path, file, md5
  and lstat.
...
  This is fairly simple stuff, but if this functionality is useful to you,
  do let me know and I can share the programme with you.
 
 This sounds like a useful tool for any Bacula site that's managing
 Bacula backups for a large number of clients.

Hi Phil

I'd be delighted if you could take a look at the python script and for
your comments.

It is part of the small .tgz archive here:
http://campbell-lange.net/media/files/bacula_tools_01.tgz

Please **do not** run it on a production Postgresql database.

Note that big backups (one with more than 7 million files, say) may take
up to 45 minutes to process.

If you are able to get the system to operate and you think it is useful
I'll stick the script on Bitbucket.

Regards
Rory

-- 
Rory Campbell-Lange
r...@campbell-lange.net

Campbell-Lange Workshop
www.campbell-lange.net
0207 6311 555
3 Tottenham Street London W1T 2AF
Registered in England No. 04551928

--
Virtualization is moving to the mainstream and overtaking non-virtualized
environment for deploying applications. Does it make network security 
easier or more difficult to achieve? Read this whitepaper to separate the 
two and get a better understanding.
http://p.sf.net/sfu/hp-phase2-d2d
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catalogue snapshot utility : any interest?

2010-10-04 Thread Phil Stracchino
On 10/04/10 08:01, Rory Campbell-Lange wrote:
 Hi Phil
 
 I'd be delighted if you could take a look at the python script and for
 your comments.

I really can't help with testing it, sorry.  I don't run PostgreSQL and
don't speak Python.  ;)



-- 
  Phil Stracchino, CDK#2 DoD#299792458 ICBM: 43.5607, -71.355
  ala...@caerllewys.net   ala...@metrocast.net   p...@co.ordinate.org
 Renaissance Man, Unix ronin, Perl hacker, Free Stater
 It's not the years, it's the mileage.

--
Virtualization is moving to the mainstream and overtaking non-virtualized
environment for deploying applications. Does it make network security 
easier or more difficult to achieve? Read this whitepaper to separate the 
two and get a better understanding.
http://p.sf.net/sfu/hp-phase2-d2d
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Catalogue snapshot utility : any interest?

2010-10-04 Thread James Harper
 
 I have developed a catalogue snapshot facility in python to snapshot
one
 job's catalogue and dump it to disk. The snapshot provides a bacula
 database schema file, a database dump of the job's data, and a file
 listing of files showing info such as the tape number, path, file, md5
 and lstat.
 
 We intend to include the catalogue in compressed format on CDs
 accompanying tape sets to assist our clients retrieve data in future
if
 required.
 
 At present the system works only for Postgresql, and for our setup
which
 has the director, storage and file daemons on the same Linux server.
 
 How it works:
 
 * A temporary schema is made in postgres, named job_%d % (jobid)
 * Relevant data is selected from the public schema to the
temporary
   schema
 * The file listing is ouput
 * The public schema is dumped
 * The temporary schema is dumped
 * The temporary schema is removed
 
 I'm considering making an sqlite database from the temporary schema to
 obviate the need for the public schema file and file listing.
 
 This is fairly simple stuff, but if this functionality is useful to
you,
 do let me know and I can share the programme with you.
 

How much smaller is the catalogue subset vs the full catalogue?

James

--
Virtualization is moving to the mainstream and overtaking non-virtualized
environment for deploying applications. Does it make network security 
easier or more difficult to achieve? Read this whitepaper to separate the 
two and get a better understanding.
http://p.sf.net/sfu/hp-phase2-d2d
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users