On 9/27/05, Carl Lowenstein <[EMAIL PROTECTED]> wrote: > On 9/27/05, m ike <[EMAIL PROTECTED]> wrote: > > On 9/27/05, Carl Lowenstein <[EMAIL PROTECTED]> wrote: > > > On 9/27/05, m ike <[EMAIL PROTECTED]> wrote: > > > > for extracting a portion of a file, the dd command can be hastened > > > > dramatically (by a factor of 10,000) by changing the to bs=1024 > > > > (for example) and increasing count to be inclusive, and then piping > > > > the result to head -c to trim it down to exact byte-size. > > > > > > > > 10,000 may be an exaggeration. okay it is an exaggeration. but does > > > > not seem to be far off. > > > > > > There are two reasons for using large block sizes in dd. One is to > > > eliminate the overhead of e.g. issuing a million system calls each to > > > read one byte, vs. one system call to read a million bytes. The other > > > is to reduce the effect of missing the "next block" in a disk read. > > > If you have to wait for a whole disk revolution to read a block, your > > > data transfer slows down proportional to the number of blocks per > > > cylinder. Nowadays this can range from 600 at the inner radius to > > > 1200 at the outer. (these are real physical blocks, not the fictional > > > blocks that LBA software uses). > > > > fwiw, afaif, when one is grabbing a specific hunk within > > a file, the largest bs= that can be specified is the greatest > > common denominator of skip= and count=. > > Not if you first grab a large chunk and then skip and count within it > for a smaller selection.
Exactly! Sorry for the confusion. That was my trick in the initial post :) of this thread ---- except that the first process _starts_ accurately (but runs too long), and the second process _ends_ accurately. fwiw, here is the updated worksheet. changes are in the final section. ################################################################ # 2005-09-27 # # for reasons of speed, I have changed the dissection idiom to: # dd | head -c # # # 2005-09-16 # This is a worksheet that I developed to dissect intact all # 116 jpgs from about 282MB of an accidentally reformatted # and partially overwritten 512MB FAT16 CF card (Olympus c5050) # # The 282MB turned out to be un-fragmented in the sense that # each JPG resided in a continuous stretch of disk space. # # Working in a bash shell, the basic approach is to use sed to # make the jpg begin/close markers grep-able, then use grep to # identify their byte-offsets, then use dd to dissect the jpg. # ######## # # sed --version # GNU sed version 4.1.2 # grep --version # grep (GNU grep) 2.5.1 # bash --version # GNU bash, version 3.00.0(1)-release (i586-suse-linux) # # FF D8 is the beginning marker of a jpg. # FF D9 is the closing marker of a jpg # Since each jpg contains an embedded jpg thumbnail, there # will be nested pairs of markers. # # grep's -b option will report the byte-offset of the line # containing the match, not the offset of the match itself. # # This page got me started (thanks TsuruZoh Tachibanaya): # # http://www.media.mit.edu/pia/Research/deepview/exif.html # ################################################################ ########## MAKE A WORKING COPY OF THE FLASH CARD ########## THIS STEP IS OPTIONAL cat /dev/sda1 > CF ########## MAKE A SMALLER FILE TO WORK WITH ########## THIS STEP IS OPTIONAL # find the byte offset of the first jpg residing # in the not-overwritten half of the card # grab lines containing either the begin marker or exif # date/time info. hexdump -C CF | grep -e "\ ff\ d8\ \|[0-9]:[0-9]" > CF_grepped_hexdump # The hexdump takes a few minutes for 512MB # # To locate first deleted jpg, hand search the output paying # attention to exif date/time strings in the ascii column. # # Each line generated by hexdump begins with a hexadecimal # number that indicates the byte-offset, such as 0d6bf600 # # the following line converts that number to base-ten printf "%d" 0x0d6bf600 # Calculate the number of bytes in the file that # follow the offset: filesize - byteoffset tail -c 287271936 CF > CF_short # Verify the short file starts with the jpg marker FF D8 hexdump -C CF_short | head ########## MAKE FILES THAT ARE GREP-ABLE for FF D8 and FF D9 # verify that ^BEGIN or ^CLOSE will be unique (grep should # grep should not find any matches) grep ^BEGIN CF_short grep ^CLOSE CF_short # \x0A is the newline character. Make sure to insert it so # that in the next step, grep will report the offset of the # markers. cat CF_short | sed -e 's/\xFF\xD8/\x0ABEGIN/' > SED_begin_1 cat CF_short | sed -e 's/\xFF\xD9/\x0ACLOSE/' > SED_close_1 # Note that sed will replace 2 bytes with 6 bytes. Note also # that the byte offests for the close markers will indicate # the beginning of the 2-byte markers, not their ends. These # 2 issues will need to be accounted for later. ########## REMOVE UNIMPORTANT LINES: # "strings" strips non-printable characters. # The 2nd grep filters out unwanted lines. # The 2nd sed leaves only the byte offset. grep -a -b ^BEGIN SED_begin_1 | strings | grep BEGIN | sed -e 's#:BEGIN.*$##' > SED_begin_2 grep -a -b ^CLOSE SED_close_1 | strings | grep CLOSE | sed -e 's#:CLOSE.*$##' > SED_close_2 wc SED_begin_2 # 232 232 2217 SED_begin_2 wc SED_close_2 # 2391 2391 23802 SED_close_2 # note the excess number of CLOSEs, presumably left over # from previous uses of the flash card. these excess # CLOSEs occur (?) at the end of the CF card, and in the # not-overwritten space that exists between the jpgs ############ INSPECT THE BYTE-DISTANCE BETWEEN SUCCESSIVE BEGINS ############ THIS STEP IS OPTIONAL old_offset=0; n_begin=0; for i in `cat SED_begin_2`; do (( n_begin += 1 )); new_offset=${i/\:BEGIN*/}; distance=$(( new_offset - old_offset )); # if [ $distance -lt 4096 ]; #then printf "%4d " $n_begin ; printf "%10d %10d " $new_offset $old_offset; printf "%10d\n" $distance; # fi; old_offset=$new_offset; done # everything looks good so far (due to exif # header data, every other distance is 4096+4) ########## ADJUST THE BYTE OFFSETS # Subtract 4(n-1) bytes from the nth offset # Add 2 bytes to the close offsets to include FF D9 # Subtract 1 because grep's byte-offsets are 1-based # whereas dd's skip option is 0-based rm -f SED_begin_3 nn=-1; for i in `cat SED_begin_2`; do (( nn += 1)); echo $(( i - 4 * nn - 1 )) >> SED_begin_3; done rm -f SED_close_3 nn=-1; for i in `cat SED_close_2`; do (( nn += 1)); echo $(( 2 + i - 4 * nn - 1 )) >> SED_close_3; done ########## GRAB EVERY OTHER LINE, BEGINNING WITH 1ST # this is necessary due to the embedded jpg thumbnail cat SED_begin_3 | sed -n '1~2p' > SED_begin_4 wc SED_begin_4 # 116 116 1107 SED_begin_2v # looks like 116 jpgs will be recovered !! ############ CALCULATE THE EXTENT OF EACH JPG # if this code runs smoothly, uncomment the CPU intensive # dd command and run it again to dissect out the jpgs. rm -f recovered*jpg; old_close_offset=0; n_begin=0; for begin_offset in `cat SED_begin_4`; do (( n_begin += 1 )); # Now find the second closing marker. The first # closing marker belongs to the embedded thumbnail n_found=0; n_close=0; for close_offset in `cat SED_close_3`; do (( n_close += 1 )); if [ $close_offset -gt $begin_offset ]; then (( n_found += 1 )); if [ $n_found -eq 2 ]; then break; fi; fi; done; size_of_blk=$(( 1024 * 4 )); size_of_jpg=$(( $close_offset - $begin_offset )); size_of_gap=$(( $begin_offset - $old_close_offset )); begin_error=$(( begin_offset - ( size_of_blk * ( begin_offset / size_of_blk ) ) )); skip=$(( begin_offset / size_of_blk )); # add 1 to round up chunk=$(( 1 + ( size_of_jpg / size_of_blk ) )); fn=`printf "recovered_%04d.jpg" $n_begin`; printf "%12s " $fn; printf "%5d %5d " $n_begin $n_close; printf "%10d %10d " $size_of_gap $size_of_jpg; printf "%5d " $begin_error; printf "%10d %10d " $begin_offset $close_offset; printf "(%4d %6d %4d)\n" $size_of_blk $skip $chunk; old_close_offset=$close_offset; #dd bs=${size_of_blk}c skip=${skip}c count=${chunk}c if=CF_short | head -c $size_of_jpg > $fn; done -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
