[email protected] wrote:

I sort of already answered this in my message a moment ago, but I'll
address it here for clarity.

> What is the purpose of these archives, and how do they really differ
> from backups?

Backups are made, and managed, from the point of view of the system (or
the sysadmin). It is "all the data and meta data on the file system" (or
database but that is often a slightly more specialized case) as of a
particular point in time so that we can recover that data/the state of
the file system if there is a failure between then and the next backup.

Often there are multiple versions available from the backup system,
because the loss is not always discovered immediately (and because it is
easy and helps users). How big that window is depends on a lot of
factors, but it is relatively small (days, weeks, maybe months).

> is it that you need to be able to selectivly 'forget' the archive data
> (in some legal environments you must have data back to a specific date,
> but must not have any data older than that for example)

This is one key part of what started me down this path. The University
-- like many (most?) organizations, has record retention and destruction
policies. Those policies (correctly) deal with the contents and context
of the "record" (data/documents), and don't care what form it is in (the
fact that is is email is not what matters. what matters is that the
email is about a purchase, or is a personel issue, etc). Financial
records have a set schedule, how long they have to be kept, and when
they should be destroyed. Personel records have a different
schedule. Data from a research project has yet another schedule, which
might be different from another project depending on the details of the
research contract (as a side note, NSF now wants data management plans
for every research proposal, this is sort of related to that).

The problem with keeping backups "forever" (or a long time) is that
backups don't distinguish the type of data. If I keep backups for 2
years, I extend for 2 years how long we have *all* the data -- including
records that the user "destroyed" (be deleting) according to the
document retention/destruction schedule.

So backups (which have "all" the data) should only be kept for long
enough to detect a failure (disaster), and then of course for however
long it takes to recover (restore the data)*.

When I told one of our faculty that we planned to significantly change
the backup retention, he raised a bunch of issues which made me realize
that the faculty are (incorrectly) assuming that our backups meet their
archival needs, which they don't. 

[To make things more strange, I can't remember the last time we had to
restore data from an "old" backup -- so they haven't actually used this
feature, but they assume it meets their needs to be able to get to old
data]

Archives are specific sets of data, selected by whoever controls the
data (in our case, the research project, but it could also be the
department manager), which are specifically saved -- they become a
record in and of themselves, and they should be saved and later
destroyed according to the appropriate records management schedule
(which is not my problem, that is the responsibility of the project).

If (for example) an archive has to be kept for 10 years, then we need
technical tools to not only organize the archive so that it can be found
when needed, and detroyed when it should be, but also to make sure it is
usable (avoid bit rot).

The simple answer is for each project to make an ad-hoc copy of their
data (an "archive") when they want to "make an archive" and stuff it
somewhere in the filesystem, and manage it themselves.

What I was wondering is if there is a set of best practices and
standards for "data archives" and if we (or hopefully someone else)
should offer a specific "data archive" service to meet this need (so
that everyone does it the same way, instead of ad-hoc).

Does that help explain it?

     --david
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to