cmp requires both files, right? I suspect the problem here is "is the file different today than what it was yesterday?"
Interesting point on record lengths. Even RDW hashes don't actually solve the entire (theoretical -- probably not real) problem. What about two FB files with the same bits but different record lengths (e.g., 500 80-byte records versus 1000 40-byte records). They are unarguably different but a hash might yield the same sum for both. Is the requirement exclusively QSAM files? What about VSAM? What about PDS? PDSE? PDS(E) as a whole or member by member? I strongly suspect that unlike for a QSAM file, that if I overwrote a PDSE member with identical data, the hash for the PDSE as a whole would change. Charles -----Original Message----- From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf Of Paul Gilmartin Sent: Thursday, August 20, 2015 1:51 PM To: [email protected] Subject: Re: Ideas for hash of a sequential data set On Thu, 20 Aug 2015 11:55:22 -0500, Kirk Wolf wrote: > >The problem, of course, is that DSCBs don't have "last update timestamps". > And in systems that have them, timestamps often can be forged by the user. I once recovered a z/OS HMS backed up data set and was dismayed to see that the timestamp was set to the time of recovery rather than the time of last access. >My initial whack at this would be to use a two-part hash: > >part 1: a shortened SHA1-hash of the format-1/8 DSCB part 2: a full >SHA-1 hash of all of the data > Your hashing should be sensitive to record boundaries, else an operation as simple as splitting a record in two will not be detected as a change. Perhaps hash the RDWs also. (I argued this on CMS-PIPELINES a while ago. The Bad Guys won.) Would performance be better by replacing hash with diff(1) or cmp(1)? ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
