Hi Andrew: Here's a slightly different perspective that might help illuminate the checker and it's rationale. While I concur with Mark that there are engineering issues with the implementation, I think it's a mistake to view it as a *file* integrity system (for which - as Mark rightly observes - there are better tools available). The checker was meant to be a *content* integrity system used in support of preservation. I'll explain the difference:
When content is ingested into DSpace, an association is forged between metadata, licenses, etc and various content files. The only reliable identifier for the latter are their checksums, since their names and many other attributes are file-system relative or not unique, etc. A legitimate question to put to a system that manages such resources over archival lengths of time is this: how do I know that the ingested files (as described by some metadata) are the same as what you are providing now (Tom De Mulder's question)? The checker is a tool that is supposed to raise our confidence in the answer the system provides, by periodically comparing the checksum recorded at ingest with current asset-store values. Of course, if the file system becomes corrupt, there is much greater likelihood that there will be an I/O failure, or other symptom, than an incorrect checksum reported: that isn't the case the checker was primarily designed for. Rather, the checker supposes that over longer periods of time, content files will be moved from one storage device to another, from local disk to SAN to the cloud, from spinning disk to holographic cell, etc. At any of these junctures, (or even doing routine maintenance), due to mistake or malice, a different file can be substituted for the intended one. And the checker should eventually detect this condition. Of course the checker is not an especially high security content integrity service (HP Labs has done work in this area, and has a system that operates with DSpace) - a wily, resourceful opponent could hack into the database and alter the original checksum along with the content. And of course if your repository has no concerns about long-term integrity, you should give the checker a pass. Richard R On Fri, 2009-04-17 at 01:10 -0700, Mark Diggory wrote: > Andrew, > > As a commiter, I have to be careful that my opinion may be construed > as the viewpoint of the DSpace developers. So I will clarify that this > is only my opinion, not the groups. > > I've never been impressed with the reasoning behind this addition to > DSpace, it mistakes bitstream security and file corruption as > something that should be tracked by the DSpace application. We > encountered problems with the checksum checker getting bogged down due > to some issue in the code/database. I was never able to get it > restarted and continued to waste time on it until our IT Systems Admin > showed me the light... > > A real file integrity system should be implemented outside of the > application by an experienced system administrator vested in > maintaining the security and integrity of the system, not in the > application by a webapplication developer. I do value and respect the > team that developed this addon to DSpace, but disagree with the > approach and the complexity of the code. Instead I would recommend > running something more professional like the following on ones > assetstore. > > http://www.sentry-go.com > http://www.cs.tut.fi/~rammer/aide.html > http://www.tripwire.com > http://www.solidcore.com/ > > Cheers, > Mark > > > On Fri, Apr 17, 2009 at 12:10 AM, Andrew Marlow > <[email protected]> wrote: > > Hello DSpacers, > > > > I came across the checksum checker recently and I don't understand why it is > > useful. Here is what I found on the WIKI:- > > --- > > DSpace now comes with a Checksum Checker script ([dspace]/bin/checker) which > > can be scheduled to verify the checksum of every item within DSpace. Since > > DSpace calculates and records the checksum of every file submitted to it, > > this script is able to determine whether or not a file has been changed > > (either manually or by some sort of corruption or virus). > > --- > > So why would an item be corrupt? Or altered manually? I know that any > > filesystem object can get problems when the filesystem goes wobbly, that's > > why we have backups. But surely the normal operating system monitoring > > utilities will tell us when a filesystem needs repair? Can someone explain > > please? > > -- > > Regards, > > > > Andrew M. > > http://www.andrewpetermarlow.co.uk > > > > > > ------------------------------------------------------------------------------ > > Stay on top of everything new and different, both inside and > > around Java (TM) technology - register by April 22, and save > > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > > 300 plus technical and hands-on sessions. Register today. > > Use priority code J9JMT32. http://p.sf.net/sfu/p > > _______________________________________________ > > DSpace-tech mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > > > > > > ------------------------------------------------------------------------------ Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

