Il 05/03/15 05:42, Bruce Momjian ha scritto:
> On Thu, Mar  5, 2015 at 01:25:13PM +0900, Fujii Masao wrote:
>>>> Yeah, it might make the situation better than today. But I'm afraid that
>>>> many users might get disappointed about that behavior of an incremental
>>>> backup after the release...
>>> I don't get what do you mean here. Can you elaborate this point?
>> The proposed version of LSN-based incremental backup has some limitations
>> (e.g., every database files need to be read even when there is no 
>> modification
>> in database since last backup, and which may make the backup time longer than
>> users expect) which may disappoint users. So I'm afraid that users who can
>> benefit from the feature might be very limited. IOW, I'm just sticking to
>> the idea of timestamp-based one :) But I should drop it if the majority in
>> the list prefers the LSN-based one even if it has such limitations.
> We need numbers on how effective each level of tracking will be.  Until
> then, the patch can't move forward.

I've written a little test script to estimate how much space can be saved by 
file level incremental, and I've run it on some real backups I have access to.

The script takes two basebackup directory and simulate how much data can be 
saved in the 2nd backup using incremental backup (using file size/time and LSN)

It assumes that every file in base, global and pg_tblspc which matches both 
size and modification time will also match from the LSN point of view.

The result is that many databases can take advantage of incremental, even if 
not do big, and considering LSNs yield a result almost identical to the 
approach based on filesystem metadata.

== Very big geographic database (similar to openstreetmap main DB), it contains 
versioned data, interval two months 

First backup size: 13286623850656 (12.1TiB)
Second backup size: 13323511925626 (12.1TiB)
Matching files count: 17094
Matching LSN count: 14580
Matching files size: 9129755116499 (8.3TiB, 68.5%)
Matching LSN size: 9128568799332 (8.3TiB, 68.5%)

== Big on-line store database, old data regularly moved to historic partitions, 
interval one day

First backup size: 1355285058842 (1.2TiB)
Second backup size: 1358389467239 (1.2TiB)
Matching files count: 3937
Matching LSN count: 2821
Matching files size: 762292960220 (709.9GiB, 56.1%)
Matching LSN size: 762122543668 (709.8GiB, 56.1%)

== Ticketing system database, interval one day

First backup size: 144988275 (138.3MiB)
Second backup size: 146135155 (139.4MiB)
Matching files count: 3124
Matching LSN count: 2641
Matching files size: 76908986 (73.3MiB, 52.6%)
Matching LSN size: 67747928 (64.6MiB, 46.4%)

== Online store, interval one day

First backup size: 20418561133 (19.0GiB)
Second backup size: 20475290733 (19.1GiB)
Matching files count: 5744
Matching LSN count: 4302
Matching files size: 4432709876 (4.1GiB, 21.6%)
Matching LSN size: 4388993884 (4.1GiB, 21.4%)

== Heavily updated database, interval one week

First backup size: 3203198962 (3.0GiB)
Second backup size: 3222409202 (3.0GiB)
Matching files count: 1801
Matching LSN count: 1273
Matching files size: 91206317 (87.0MiB, 2.8%)
Matching LSN size: 69083532 (65.9MiB, 2.1%)


Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support |
#!/usr/bin/env python
from __future__ import print_function

import collections
from optparse import OptionParser
import os
import sys

__author__ = 'Marco Nenciarini <>'

usage = """usage: %prog [options] backup_1 backup_2
parser = OptionParser(usage=usage)
(options, args) = parser.parse_args()

# need 2 directories
if len(args) != 2:

FileItem = collections.namedtuple('FileItem', 'size time path')

def get_files(target_dir):
    """Return a set of FileItem"""
    files = set()
    for dir_path, _, file_names in os.walk(target_dir, followlinks=True):
        for filename in file_names:
            path = os.path.join(dir_path, filename)
            rel_path = path[len(target_dir) + 1:]
            stats = os.stat(path)
            files.add(FileItem(stats.st_size, stats.st_mtime, rel_path))
    return files

def size_fmt(num, suffix='B'):
    """Format a size"""
    for unit in ['', 'Ki', 'Mi', 'Gi', 'Ti', 'Pi', 'Ei', 'Zi']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

def percent_fmt(a, b):
    """Format a percent"""
    return "%.1f%%" % (100.0*a/b)

def get_size(file_set):
    return reduce(lambda x, y: x + y.size, file_set, 0)

def report(a, b):
    # find files that are identical (same size and same time)
    common = a & b

    # remove files count only main forks LSN
    common_lsn = set()
    for item in common:
        # remove non main fork
        if any(suffix in item.path
               for suffix in ('_fsm', '_vm', '_init')):
        # remove things outside data directory
        if not any(item.path.startswith(prefix)
                   for prefix in ('base', 'global', 'pg_tblspc')):

    a_size = get_size(a)
    b_size = get_size(b)
    fs_based_size = get_size(common)
    lsn_based_size = get_size(common_lsn)

    print("First backup size: %d (%s)" % (a_size, size_fmt(a_size)))
    print("Second backup size: %d (%s)" % (b_size, size_fmt(b_size)))
    print("Matching files count: %d" % (len(common)))
    print("Matching LSN count: %d" % (len(common_lsn)))
    print("Matching files size: %d (%s, %s)" % (
        percent_fmt(fs_based_size, b_size)))
    print("Matching LSN size: %d (%s, %s)" % (
        percent_fmt(lsn_based_size, b_size)))

a = get_files(args[0])
b = get_files(args[1])


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to