Hi,

On Thu, May 07, 2026 at 01:32:35PM +0200, Hans wrote:
> I would like to tell, why I asked.

It is usually a good idea to do this from the start, because it may be
that your use case is solvable in other ways.

This time though, the easier solutions have major deficiencies while the
better solutions are really complex and/or expensive.

TL;DR: If you can put in time but not much money, I recommend focusing
on backups, config management, and use of both to get a good mean time
to recovery.

> So my solution would be, having drive images, restore them onto a new server 
> and of we go.

Making a "golden image" of a server, taken after it is installed, booted
into a live environment or some other way to read the disk without
having it mounted, is a tried and tested decades old way of being able
to quickly return a server to service.

The hard part is making up to date images of servers while they are
operational in order to capture application data. It's understandable
why you desire to make such an image while the server is running. People
in this thread have correctly explained to you why that's not possible.

> But as I do not know, if this is possible at all (due to possible changes 
> while the drives are mounted), and could not find a way searching the web, I 
> allowed me, to ask here some other experts. Maybe they might know, if this 
> issue can be solved or not. 
> 
> But it appears, it is not, and so my question was fullly answered.

There isn't an easy answer, but there are answers. This is a problem
that everyone managing computers has. Some ignore it; the rest of us
come up with solutions that can never be perfect but involve trade-offs
we tolerate.

Firstly I would say, step back and consider your backup strategy. Even
if we were to suppose that it was possible to take an instant image of a
running server, THAT WOULD STILL NOT BE A BACKUPS. So, independently of
this issue, you need to have backups of application data. You need thagt
because "server going on fire" is all or even most of the ways you lose
data. Human error is more common. People delete stuff and mess it up.
You as the operator get called upon to restore application data from a
week or a month ago.

So one way to look at this is, there has to be backups up in place, so
work out how to restore your base server image and then replay the
application data from backups onto it.

The correct way to take consistent backups of application data will
depend upon the application. For example, a database server like
Postgres or MariaDB has commands you can issue to take consistent
database-level locks and dump out the entire content of ther chosen
tables.

Maybe as the operator of the service you consider application-level
backups to be the users' concerns. So then your problem space is just
the base server image plus any custom configuration you did since.
There's many ways to solve that. A popular one is configuration
management: your configuration is stored like code and the config
management software can quickly and easily apply it to a base server
image, bringing the service live in a short period of time. It is a big
investment of time to set up and requires ongoing discipline to make
future changes in config management, not on the live servers.

The issue of keeping the service running in the face of hardware failure
is resilience. If you have enough resources then you can design
something that doesn't have data consistency issues. For example, you
can have a Ceph storage cluster redundant at every level and maybe even
multi-site. If something breaks you fail over to different servers and
all the data is still there. Very few of us can justify that sort of
spend, so no more on that.

Your options without application-specific facilities are basically
different forms of filesystem snapshot. People already went through
options like btrfs, zfs, or LVM underneath other filesystems.

btrfs and zfs will take a (filesystem-level) consistent snapshot while
an LVM snapshot will appear like a power loss event to filesystems on
top of it. Modern filesystems are pretty robust against this and
applications that care about data safety take steps to be consistent
with power failure also: you lose some data that didn't get committed in
time but nothing should be corrupted or half-committed. Of course,
applications can be buggy, database cleitns can neglect to use
transactions properly, etc. That's why you need backups.

Instead of filesystem snapshots you could look at disk mirrors. For
example, if you installed a server with two identical drives and made
sure that everything was in mdadm RAID-1 then every so often you could
pull out one drive and insert a blank one. THe array(s) should sync onto
the blank drive and the drive you removed becomes your disaster recovery
image: You could insert it in a new server and boot it and it would
seem like a power loss event at the point you took the drive out.

There are lots of variations on this. You might not like that the server
has no redundancy while the mirror is broken, so maybe you run it as a
3-way mirror. Maybe you don't like that you have to physically do to
each server and swap a drive. In that case you could use DRBD to have
something like a RAID-1 but it's over the network to another server. As
we pursue these ideas further it gets more and more expensive.

Running applications inside virtual machines and containers helps to
confine their data into something that is easier to manage from the
outside.

For small sites I think there is a lot to be recommended about
configuration management, possibly with "golden images", to quickly get
a base system up, and a way to restore application data from backups -
since either you need that anyway or else it is purely the users'
problem!

No easy answers, sorry!

> Hope, I made not too much noise!

The question was welcome, it's just that it merely opens an area of
discussion that can easily be the focus of a person's entire working
life. 😀

Thanks,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

Reply via email to