Il 05/03/15 05:42, Bruce Momjian ha scritto: > On Thu, Mar 5, 2015 at 01:25:13PM +0900, Fujii Masao wrote: >>>> Yeah, it might make the situation better than today. But I'm afraid that >>>> many users might get disappointed about that behavior of an incremental >>>> backup after the release... >>> >>> I don't get what do you mean here. Can you elaborate this point? >> >> The proposed version of LSN-based incremental backup has some limitations >> (e.g., every database files need to be read even when there is no >> modification >> in database since last backup, and which may make the backup time longer than >> users expect) which may disappoint users. So I'm afraid that users who can >> benefit from the feature might be very limited. IOW, I'm just sticking to >> the idea of timestamp-based one :) But I should drop it if the majority in >> the list prefers the LSN-based one even if it has such limitations. > > We need numbers on how effective each level of tracking will be. Until > then, the patch can't move forward. >
I've written a little test script to estimate how much space can be saved by file level incremental, and I've run it on some real backups I have access to. The script takes two basebackup directory and simulate how much data can be saved in the 2nd backup using incremental backup (using file size/time and LSN) It assumes that every file in base, global and pg_tblspc which matches both size and modification time will also match from the LSN point of view. The result is that many databases can take advantage of incremental, even if not do big, and considering LSNs yield a result almost identical to the approach based on filesystem metadata. == Very big geographic database (similar to openstreetmap main DB), it contains versioned data, interval two months First backup size: 13286623850656 (12.1TiB) Second backup size: 13323511925626 (12.1TiB) Matching files count: 17094 Matching LSN count: 14580 Matching files size: 9129755116499 (8.3TiB, 68.5%) Matching LSN size: 9128568799332 (8.3TiB, 68.5%) == Big on-line store database, old data regularly moved to historic partitions, interval one day First backup size: 1355285058842 (1.2TiB) Second backup size: 1358389467239 (1.2TiB) Matching files count: 3937 Matching LSN count: 2821 Matching files size: 762292960220 (709.9GiB, 56.1%) Matching LSN size: 762122543668 (709.8GiB, 56.1%) == Ticketing system database, interval one day First backup size: 144988275 (138.3MiB) Second backup size: 146135155 (139.4MiB) Matching files count: 3124 Matching LSN count: 2641 Matching files size: 76908986 (73.3MiB, 52.6%) Matching LSN size: 67747928 (64.6MiB, 46.4%) == Online store, interval one day First backup size: 20418561133 (19.0GiB) Second backup size: 20475290733 (19.1GiB) Matching files count: 5744 Matching LSN count: 4302 Matching files size: 4432709876 (4.1GiB, 21.6%) Matching LSN size: 4388993884 (4.1GiB, 21.4%) == Heavily updated database, interval one week First backup size: 3203198962 (3.0GiB) Second backup size: 3222409202 (3.0GiB) Matching files count: 1801 Matching LSN count: 1273 Matching files size: 91206317 (87.0MiB, 2.8%) Matching LSN size: 69083532 (65.9MiB, 2.1%) Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
#!/usr/bin/env python from __future__ import print_function import collections from optparse import OptionParser import os import sys __author__ = 'Marco Nenciarini <marco.nenciar...@2ndquadrant.it>' usage = """usage: %prog [options] backup_1 backup_2 """ parser = OptionParser(usage=usage) (options, args) = parser.parse_args() # need 2 directories if len(args) != 2: parser.print_help() sys.exit(1) FileItem = collections.namedtuple('FileItem', 'size time path') def get_files(target_dir): """Return a set of FileItem""" files = set() for dir_path, _, file_names in os.walk(target_dir, followlinks=True): for filename in file_names: path = os.path.join(dir_path, filename) rel_path = path[len(target_dir) + 1:] stats = os.stat(path) files.add(FileItem(stats.st_size, stats.st_mtime, rel_path)) return files def size_fmt(num, suffix='B'): """Format a size""" for unit in ['', 'Ki', 'Mi', 'Gi', 'Ti', 'Pi', 'Ei', 'Zi']: if abs(num) < 1024.0: return "%3.1f%s%s" % (num, unit, suffix) num /= 1024.0 return "%.1f%s%s" % (num, 'Yi', suffix) def percent_fmt(a, b): """Format a percent""" return "%.1f%%" % (100.0*a/b) def get_size(file_set): return reduce(lambda x, y: x + y.size, file_set, 0) def report(a, b): # find files that are identical (same size and same time) common = a & b # remove files count only main forks LSN common_lsn = set() for item in common: # remove non main fork if any(suffix in item.path for suffix in ('_fsm', '_vm', '_init')): continue # remove things outside data directory if not any(item.path.startswith(prefix) for prefix in ('base', 'global', 'pg_tblspc')): continue common_lsn.add(item) a_size = get_size(a) b_size = get_size(b) fs_based_size = get_size(common) lsn_based_size = get_size(common_lsn) print("First backup size: %d (%s)" % (a_size, size_fmt(a_size))) print("Second backup size: %d (%s)" % (b_size, size_fmt(b_size))) print("Matching files count: %d" % (len(common))) print("Matching LSN count: %d" % (len(common_lsn))) print("Matching files size: %d (%s, %s)" % ( fs_based_size, size_fmt(fs_based_size), percent_fmt(fs_based_size, b_size))) print("Matching LSN size: %d (%s, %s)" % ( lsn_based_size, size_fmt(lsn_based_size), percent_fmt(lsn_based_size, b_size))) a = get_files(args[0]) b = get_files(args[1]) report(a,b)
signature.asc
Description: OpenPGP digital signature