Re: SuperC utility loses its place when comparing text files with many differences

Joel C. Ewing Tue, 08 Jan 2013 11:24:06 -0800

On 01/08/2013 10:11 AM, Farley, Peter x23353 wrote:

I have usually found that the SuperC utility (ASMFSUPC actually, the HLASM 
Toolkit enhanced version) does a very creditable job of finding differences in 
text or report files, and is quite useful in regression testing batch 
application changes.


However, I just discovered that the compare process can "lose its place" 
comparing text files with a significant number of changes.  I have an application change 
that inserts an additional 4 lines of text every 6ith line of the original text, with the 
differences and inserts scattered over all the sections of a report.  There are 115,400 
new records and 115,400 changed lines in total, arranged such that there are four changed 
lines immediately preceding each of the sets of four inserted lines.  The original file 
has 268,716 lines and the new file has 389,323 lines.  There are heading lines for each 
page as well, so I used the DPLINE option to tell SuperC to ignore the header lines and 
the entirely blank lines.

However, after 13,467 lines of the original file and 20,154 lines of the new 
file, SuperC seems to stop recognizing changed/inserted lines in groups and 
starts reporting huge globs of deletes and inserts (6,439 deletes followed by 
10,778 inserts the first time it lost its place).

Has anyone else seen this behavior?  Is there anything I can do to help SuperC "keep 
its place" and report the actual changes instead of globs of inserts and deletes?

TIA for your help with this problem.

Peter
--
...

This is a inherent problem with any algorithm which attempts toflag/detect changes between files and do it with minimum resourceconsumption. Your typical algorithm compares two files and when adifference is detected starts looking through subsequent records in bothfiles for a "resync" point where both files match up again, with trivialfalse resync points typically eliminated by requiring a resync point tohave "n" consecutive records match. There problem is that for any "n",one can create cases where "false" matches would be found, or if thechanges are too massive, no resync point may be found. We humans wouldlike the algorithm to find the "best" resync point, which would probablybe defined as one which minimizes the total changes reported for the twofiles, but it is unclear how one could compute this without a recursivealgorithm that exhaustively tries all possible resync points and doesn'tjust accept the first one found. Humans looking at two files that havefrequently recurring patterns interspersed with unique records canintuitively tune out the repetitive patterns and look at the uniquerecords when looking for a best resync points, but building those smartsinto a formal algorithm may not be feasible.


--
Joel C. Ewing,    Bentonville, AR       [email protected] 

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: SuperC utility loses its place when comparing text files with many differences

Reply via email to