Hi gang,
it seems to me that the version of wget I have archec@darni:~/pp$ wget -V | head -1 GNU Wget 1.20.3 built on linux-gnu. has a bug at the point where it generates block digests for WARC revisit records. To establish this, look at some sample lines of bash ################################################################ TARGET=http://openlib.org/home/krichel/debug.css WARC_1=/tmp/1 WARC_2=/tmp/2 DEDUP="--warc-dedup ${WARC_1}.cdx " FLAGS='-O /dev/null --no-warc-compression --no-warc-keep-log --warc-cdx' # start from clean sheet rm -f $WARC_1.warc $WARC_2.warc ${WARC_1}.cdx # run twice, dedup in CDX wget $FLAGS --warc-file $WARC_1 $TARGET wget $FLAGS --warc-file $WARC_2 $DEDUP $TARGET # append the second to the first cat $WARC_2.warc >> $WARC_1.warc # now verify python3 -m warcat --verbose verify $WARC_1.warc ################################################################ When I run this Opening WARC file ‘/tmp/1.warc’. --2020-07-09 14:57:34-- http://openlib.org/home/krichel/debug.css Resolving openlib.org (openlib.org)... 2a01:4f9:2a:23a8::2, 95.216.35.87 Connecting to openlib.org (openlib.org)|2a01:4f9:2a:23a8::2|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 53 [text/css] Saving to: ‘/dev/null’ /dev/null 100%[===============================>] 53 --.-KB/s in 0s 2020-07-09 14:57:34 (11.8 MB/s) - ‘/dev/null’ saved [53/53] Loaded 1 record from CDX. Opening WARC file ‘/tmp/2.warc’. --2020-07-09 14:57:34-- http://openlib.org/home/krichel/debug.css Resolving openlib.org (openlib.org)... 2a01:4f9:2a:23a8::2, 95.216.35.87 Connecting to openlib.org (openlib.org)|2a01:4f9:2a:23a8::2|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 53 [text/css] Saving to: ‘/dev/null’ /dev/null 100%[===============================>] 53 --.-KB/s in 0s Found exact match in CDX file. Saving revisit record to WARC. 2020-07-09 14:57:34 (15.5 MB/s) - ‘/dev/null’ saved [53/53] INFO:warcat.model.warc:Opened file /tmp/1.warc ERROR:warcat.tool:Record <urn:uuid:ccc449bb-ddcc-4436-840b-fb143f33f47d> failed validation Traceback (most recent call last): File "/home/archec/local/lib/python/warcat/tool.py", line 283, in action action(record) File "/home/archec/local/lib/python/warcat/tool.py", line 292, in verify_block_digest raise VerifyProblem('Bad block digest.', '5.8') warcat.tool.VerifyProblem: ('Bad block digest.', '5.8', True) INFO:warcat.model.warc:Finished reading Warc Validation failed. Problems: 1. The dedup works but the block digest on revisit record is not correct. All other records validate just fine. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
