Re: [Dspace-tech] Why the DSpace checksum checker?

Richard Rodgers Fri, 17 Apr 2009 09:17:44 -0700

Hi Andrew:

Here's a slightly different perspective that might help illuminate the
checker and it's rationale. While I concur with Mark that there are
engineering issues with the implementation, I think it's a mistake to
view it as a *file* integrity system (for which - as Mark rightly
observes - there are better tools available). The checker was meant to
be a *content* integrity system used in support of preservation. I'll
explain the difference:

When content is ingested into DSpace, an association is forged between
metadata, licenses, etc and various content files. The only reliable
identifier for the latter are their checksums, since their names and
many other attributes are file-system relative or not unique, etc. A
legitimate question to put to a system that manages such resources over
archival lengths of time is this: how do I know that the ingested files
(as described by some metadata) are the same as what you are providing
now (Tom De Mulder's question)? The checker is a tool that is supposed
to raise our confidence in the answer the system provides, by
periodically comparing the checksum recorded at ingest with current
asset-store values.

Of course, if the file system becomes corrupt, there is much greater
likelihood that there will be an I/O failure, or other symptom, than an
incorrect checksum reported: that isn't the case the checker was
primarily designed for. Rather, the checker supposes that over longer
periods of time, content files will be moved from one storage device to
another, from local disk to SAN to the cloud, from spinning disk to
holographic cell, etc. At any of these junctures, (or even doing routine
maintenance), due to mistake or malice, a different file can be
substituted for the intended one. And the checker should eventually
detect this condition.

Of course the checker is not an especially high security content
integrity service (HP Labs has done work in this area, and has a system
that operates with DSpace) - a wily, resourceful opponent could hack
into the database and alter the original checksum along with the
content.

And of course if your repository has no concerns about long-term
integrity, you should give the checker a pass.

Richard R

On Fri, 2009-04-17 at 01:10 -0700, Mark Diggory wrote:
> Andrew,
> 
> As a commiter, I have to be careful that my opinion may be construed
> as the viewpoint of the DSpace developers. So I will clarify that this
> is only my opinion, not the groups.
> 
> I've never been impressed with the reasoning behind this addition to
> DSpace, it mistakes bitstream security and file corruption as
> something that should be tracked by the DSpace application. We
> encountered problems with the checksum checker getting bogged down due
> to some issue in the code/database.  I was never able to get it
> restarted and continued to waste time on it until our IT Systems Admin
> showed me the light...
> 
> A real file integrity system should be implemented outside of the
> application by an experienced system administrator vested in
> maintaining the security and integrity of the system, not in the
> application by a webapplication developer.  I do value and respect the
> team that developed this addon to DSpace, but disagree with the
> approach and the complexity of the code.  Instead I would recommend
> running something more professional like the following on ones
> assetstore.
> 
> http://www.sentry-go.com
> http://www.cs.tut.fi/~rammer/aide.html
> http://www.tripwire.com
> http://www.solidcore.com/
> 
> Cheers,
> Mark
> 
> 
> On Fri, Apr 17, 2009 at 12:10 AM, Andrew Marlow
> <[email protected]> wrote:
> > Hello DSpacers,
> >
> > I came across the checksum checker recently and I don't understand why it is
> > useful. Here is what I found on the WIKI:-
> > ---
> > DSpace now comes with a Checksum Checker script ([dspace]/bin/checker) which
> > can be scheduled to verify the checksum of every item within DSpace. Since
> > DSpace calculates and records the checksum of every file submitted to it,
> > this script is able to determine whether or not a file has been changed
> > (either manually or by some sort of corruption or virus).
> > ---
> > So why would an item be corrupt? Or altered manually? I know that any
> > filesystem object can get problems when the filesystem goes wobbly, that's
> > why we have backups. But surely the normal operating system monitoring
> > utilities will tell us when a filesystem needs repair? Can someone explain
> > please?
> > --
> > Regards,
> >
> > Andrew M.
> > http://www.andrewpetermarlow.co.uk
> >
> >
> > ------------------------------------------------------------------------------
> > Stay on top of everything new and different, both inside and
> > around Java (TM) technology - register by April 22, and save
> > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
> > 300 plus technical and hands-on sessions. Register today.
> > Use priority code J9JMT32. http://p.sf.net/sfu/p
> > _______________________________________________
> > DSpace-tech mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> >
> >
> 
> 
> 

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Why the DSpace checksum checker?

Reply via email to