Re: Batch file question - average size of file in directory
On 2007-01-02 10:20, Kurt Buff [EMAIL PROTECTED] wrote: All, I don't even have a clue how to start this one, so am looking for a little help. I've got a directory with a large number of gzipped files in it (over 110k) along with a few thousand uncompressed files. I'd like to find the average uncompressed size of the gzipped files, and ignore the uncompressed files. How on earth would I go about doing that with the default shell (no bash or other shells installed), or in perl, or something like that. I'm no scripter of any great expertise, and am just stumbling over this trying to find an approach. You can probably use awk(1) or perl(1) to post-process the output of gzip(1). The gzip(1) utility, when run with the -cd options will uncompress the compressed files and send the uncompressed data to standard output, without actually affecting the on-disk copy of the compressed data. It is easy then to pipe the uncompressed data to wc(1) to count the 'bytes' of the uncompressed data: for fname in *.Z *.z *.gz; do if test -f ${fname}; then gzip -cd ${fname} | wc -c fi done This will print the byte-size of the uncompressed output of gzip, for all the files which are currently compressed. Something like the following could be its output: 220381 3280920 This can be piped into awk(1) for further processing, with something like this: for fname in *.Z *.gz; do if test -f $fname; then gzip -cd $fname | wc -c fi done | \ awk 'BEGIN { min = -1; max = 0; total = 0; } { total += $1; if ($1 max) { max = $1; } if (min == -1 || $1 min) { min = $1; } } END { if (NR 0) { printf min/avg/max file size = %d/%d/%d\n, min, total / NR, max; } }' With the same files as above, the output of this would be: min/avg/max file size = 220381/1750650/3280920 With a slightly modified awk(1) script, you can even print a running min/average/max count, following each line. Mmodified lines marked with a pipe character (`|') in their leftmost column below. The '|' characters are *not* part of the script itself. for fname in *.Z *.gz; do if test -f $fname; then gzip -cd $fname | wc -c fi done | \ awk 'BEGIN { min = -1; max = 0; total = 0; | printf %10s %10s %10s %10s\n, | SIZE, MIN, AVERAGE, MAX; } { total += $1; if ($1 max) { max = $1; } if (min == -1 || $1 min) { min = $1; } | printf %10d %10d %10d %10d\n, | $1, min, total/NR, max; } END { if (NR 0) { | printf %10s %10d %10d %10d\n, | TOTAL, min, total / NR, max; } }' When run with the same set of two compressed files this will print: SIZEMINAVERAGEMAX 220381 220381 220381 220381 3280920 22038117506503280920 TOTAL 22038117506503280920 Please note though that with a sufficiently large set of files, awk(1) may fail to count the total number of bytes correctly. If this is the case, it should be easy to write an equivalent Perl or Python script, to take advantage of their big-number support. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
Message: 17 Date: Tue, 2 Jan 2007 19:50:01 -0800 From: James Long [EMAIL PROTECTED] Message: 28 Date: Tue, 2 Jan 2007 10:20:08 -0800 From: Kurt Buff [EMAIL PROTECTED] I don't even have a clue how to start this one, so am looking for a little help. I've got a directory with a large number of gzipped files in it (over 110k) along with a few thousand uncompressed files. If it were me I'd mv those into a bunch of subdirectories; things get really slow with more than 500 or so files per directory .. anyway .. I'd like to find the average uncompressed size of the gzipped files, and ignore the uncompressed files. How on earth would I go about doing that with the default shell (no bash or other shells installed), or in perl, or something like that. I'm no scripter of any great expertise, and am just stumbling over this trying to find an approach. Many thanks for any help, Kurt Hi, Kurt. And hi, James, Can I make some assumptions that simplify things? No kinky filenames, just [a-zA-Z0-9.]. My approach specifically doesn't like colons or spaces, I bet. Also, you say gzipped, so I'm assuming it's ONLY gzip, no bzip2, etc. Here's a first draft that might give you some ideas. It will output: foo.gz : 3456 bar.gz : 1048576 (etc.) find . -type f | while read fname; do file $fname | grep -q compressed echo $fname : $(zcat $fname | wc -c) done % file cat7/tuning.7.gz cat7/tuning.7.gz: gzip compressed data, from Unix Good check, though grep gzip compressed excludes bzip2 etc. But you REALLY don't want to zcat 110 thousand files just to wc 'em, unless it's a benchmark :) .. may I suggest a slight speedup, template: % gunzip -l cat7/tuning.7.gz compressed uncompr. ratio uncompressed_name 13642 38421 64.5% cat7/tuning.7 If you really need a script that will do the math for you, then pip the output of this into bc: #!/bin/sh find . -type f | { n=0 echo scale=2 echo -n ( while read fname; do -if file $fname | grep -q compressed +if file $fname | grep -q gzip compressed then - echo -n $(zcat $fname | wc -c)+ + echo -n $(gunzip -l $fname | grep -v comp | awk '{print $2}')+ n=$(($n+1)) fi done echo 0) / $n } That should give you the average decompressed size of the gzip'ped files in the current directory. HTH, Ian ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
On 1/2/07, James Long [EMAIL PROTECTED] wrote: snip my problem description Hi, Kurt. Can I make some assumptions that simplify things? No kinky filenames, just [a-zA-Z0-9.]. My approach specifically doesn't like colons or spaces, I bet. Also, you say gzipped, so I'm assuming it's ONLY gzip, no bzip2, etc. Right, no other compression types - just .gz. Here's a small snippet of the directory listing: -rw-r- 1 kurt kurt 108208 Dec 21 06:15 dummy-zKLQEWrDDOZh -rw-r- 1 kurt kurt24989 Dec 28 17:29 dummy-zfzaEjlURTU1 -rw-r- 1 kurt kurt30596 Jan 2 19:37 stuff-0+-OvVrXcEoq.gz -rw-r- 1 kurt kurt 2055 Dec 22 20:25 stuff-0+19OXqwpEdH.gz -rw-r- 1 kurt kurt13781 Dec 30 03:53 stuff-0+1bMFK2XvlQ.gz -rw-r- 1 kurt kurt11485 Dec 20 04:40 stuff-0+5jriDIt0jc.gz Here's a first draft that might give you some ideas. It will output: foo.gz : 3456 bar.gz : 1048576 (etc.) find . -type f | while read fname; do file $fname | grep -q compressed echo $fname : $(zcat $fname | wc -c) done If you really need a script that will do the math for you, then pip the output of this into bc: #!/bin/sh find . -type f | { n=0 echo scale=2 echo -n ( while read fname; do if file $fname | grep -q compressed then echo -n $(zcat $fname | wc -c)+ n=$(($n+1)) fi done echo 0) / $n } That should give you the average decompressed size of the gzip'ped files in the current directory. Hmmm That's the same basic approach that Giogos took, to uncompress the file and count bytes with wc. I'm liking the 'zcat -l' contstruct, as it looks more flexible, but then I have to parse the output, probably with grep and cut. Time to put on my thinking cap - I'll get back to the list on this. Kurt ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
On 1/3/07, Ian Smith [EMAIL PROTECTED] wrote: Message: 17 Date: Tue, 2 Jan 2007 19:50:01 -0800 From: James Long [EMAIL PROTECTED] Message: 28 Date: Tue, 2 Jan 2007 10:20:08 -0800 From: Kurt Buff [EMAIL PROTECTED] I don't even have a clue how to start this one, so am looking for a little help. I've got a directory with a large number of gzipped files in it (over 110k) along with a few thousand uncompressed files. If it were me I'd mv those into a bunch of subdirectories; things get really slow with more than 500 or so files per directory .. anyway .. I just store them for a while - delete them after two weeks if they're not needed again. The overhead isn't enough to worry about at this point. I'd like to find the average uncompressed size of the gzipped files, and ignore the uncompressed files. How on earth would I go about doing that with the default shell (no bash or other shells installed), or in perl, or something like that. I'm no scripter of any great expertise, and am just stumbling over this trying to find an approach. Many thanks for any help, Kurt Hi, Kurt. And hi, James, Can I make some assumptions that simplify things? No kinky filenames, just [a-zA-Z0-9.]. My approach specifically doesn't like colons or spaces, I bet. Also, you say gzipped, so I'm assuming it's ONLY gzip, no bzip2, etc. Here's a first draft that might give you some ideas. It will output: foo.gz : 3456 bar.gz : 1048576 (etc.) find . -type f | while read fname; do file $fname | grep -q compressed echo $fname : $(zcat $fname | wc -c) done % file cat7/tuning.7.gz cat7/tuning.7.gz: gzip compressed data, from Unix Good check, though grep gzip compressed excludes bzip2 etc. But you REALLY don't want to zcat 110 thousand files just to wc 'em, unless it's a benchmark :) .. may I suggest a slight speedup, template: % gunzip -l cat7/tuning.7.gz compressed uncompr. ratio uncompressed_name 13642 38421 64.5% cat7/tuning.7 If you really need a script that will do the math for you, then pip the output of this into bc: #!/bin/sh find . -type f | { n=0 echo scale=2 echo -n ( while read fname; do -if file $fname | grep -q compressed +if file $fname | grep -q gzip compressed then - echo -n $(zcat $fname | wc -c)+ + echo -n $(gunzip -l $fname | grep -v comp | awk '{print $2}')+ n=$(($n+1)) fi done echo 0) / $n } That should give you the average decompressed size of the gzip'ped files in the current directory. HTH, Ian Ah - yes, I think that's much better. I should have thought of awk. At some point, I'd like to do a bit more processing of file sizes, such as trying to find out the number of IP packets each file would take during an SMTP transaction, so that I could categorize overhead a bit, but for now the average uncompressed file size is good enough. Thanks again for your help! Kurt ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
On 1/2/07, Giorgos Keramidas [EMAIL PROTECTED] wrote: On 2007-01-02 10:20, Kurt Buff [EMAIL PROTECTED] wrote: You can probably use awk(1) or perl(1) to post-process the output of gzip(1). The gzip(1) utility, when run with the -cd options will uncompress the compressed files and send the uncompressed data to standard output, without actually affecting the on-disk copy of the compressed data. It is easy then to pipe the uncompressed data to wc(1) to count the 'bytes' of the uncompressed data: for fname in *.Z *.z *.gz; do if test -f ${fname}; then gzip -cd ${fname} | wc -c fi done This will print the byte-size of the uncompressed output of gzip, for all the files which are currently compressed. Something like the following could be its output: I put together this one-liner after perusing 'man zcat': find /local/amavis/virusmails -name *.gz -print | xargs zcat -l out.txt It puts out multiple instances of stuff like this: compressed uncompr. ratio uncompressed_name 1508 3470 57.0% stuff-7f+BIOFX1-qX 1660 3576 54.0% stuff-bsFK-yGcWyCm 9113 17065 46.7% stuff-os1MKlKGu8ky ... ... ... 10214796 17845081 42.7% (totals) compressed uncompr. ratio uncompressed_name 7790 14732 47.2% stuff-Z3UO7-uvMANd 1806 3705 51.7% stuff-9ADk-DSBFQGQ 9020 16638 45.8% stuff-Caqfgao-Tc5F 7508 14361 47.8% stuff-kVUWa8ua4zxc I'm thinking that piping the output like so: find /local/amavis/virusmails -name *.gz -print | xargs zcat -l | grep -v compress | grep-v totals will do to suppress extraneous header/footer info This can be piped into awk(1) for further processing, with something like this: for fname in *.Z *.gz; do if test -f $fname; then gzip -cd $fname | wc -c fi done | \ awk 'BEGIN { min = -1; max = 0; total = 0; } { total += $1; if ($1 max) { max = $1; } if (min == -1 || $1 min) { min = $1; } } END { if (NR 0) { printf min/avg/max file size = %d/%d/%d\n, min, total / NR, max; } }' With the same files as above, the output of this would be: min/avg/max file size = 220381/1750650/3280920 With a slightly modified awk(1) script, you can even print a running min/average/max count, following each line. Mmodified lines marked with a pipe character (`|') in their leftmost column below. The '|' characters are *not* part of the script itself. for fname in *.Z *.gz; do if test -f $fname; then gzip -cd $fname | wc -c fi done | \ awk 'BEGIN { min = -1; max = 0; total = 0; | printf %10s %10s %10s %10s\n, | SIZE, MIN, AVERAGE, MAX; } { total += $1; if ($1 max) { max = $1; } if (min == -1 || $1 min) { min = $1; } | printf %10d %10d %10d %10d\n, | $1, min, total/NR, max; } END { if (NR 0) { | printf %10s %10d %10d %10d\n, | TOTAL, min, total / NR, max; } }' When run with the same set of two compressed files this will print: SIZEMINAVERAGEMAX 220381 220381 220381 220381 3280920 22038117506503280920 TOTAL 22038117506503280920 Please note though that with a sufficiently large set of files, awk(1) may fail to count the total number of bytes correctly. If this is the case, it should be easy to write an equivalent Perl or Python script, to take advantage of their big-number support. I'll try to parse and understand this, and see if I can modify it to suit the output I'm currently generating. Many thanks for the help! Kurt ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
On 2007-01-03 10:42, Kurt Buff [EMAIL PROTECTED] wrote: On 1/2/07, James Long [EMAIL PROTECTED] wrote: snip my problem description Hi, Kurt. Can I make some assumptions that simplify things? No kinky filenames, just [a-zA-Z0-9.]. My approach specifically doesn't like colons or spaces, I bet. Also, you say gzipped, so I'm assuming it's ONLY gzip, no bzip2, etc. Right, no other compression types - just .gz. Here's a small snippet of the directory listing: -rw-r- 1 kurt kurt 108208 Dec 21 06:15 dummy-zKLQEWrDDOZh -rw-r- 1 kurt kurt24989 Dec 28 17:29 dummy-zfzaEjlURTU1 -rw-r- 1 kurt kurt30596 Jan 2 19:37 stuff-0+-OvVrXcEoq.gz -rw-r- 1 kurt kurt 2055 Dec 22 20:25 stuff-0+19OXqwpEdH.gz -rw-r- 1 kurt kurt13781 Dec 30 03:53 stuff-0+1bMFK2XvlQ.gz -rw-r- 1 kurt kurt11485 Dec 20 04:40 stuff-0+5jriDIt0jc.gz Here's a first draft [...] Hmmm That's the same basic approach that Giogos took, to uncompress the file and count bytes with wc. I'm liking the 'zcat -l' contstruct, as it looks more flexible, but then I have to parse the output, probably with grep and cut. Excellent. I didn't know about the -l option of gzip(1) until today :) You can easily extract the uncompressed size, because it's always in column 2 and it contains only numeric digits: gzip -l *.gz *.Z *.z | awk '{print $2}' | grep '[[:digit:]]\+' Then you can feed the resulting stream of uncompressed sizes to the awk script I sent before :) ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
On 2007-01-03 10:28, Kurt Buff [EMAIL PROTECTED] wrote: I put together this one-liner after perusing 'man zcat': find /local/amavis/virusmails -name *.gz -print | xargs zcat -l out.txt It puts out multiple instances of stuff like this: compressed uncompr. ratio uncompressed_name 1508 3470 57.0% stuff-7f+BIOFX1-qX 1660 3576 54.0% stuff-bsFK-yGcWyCm 9113 17065 46.7% stuff-os1MKlKGu8ky ... ... ... 10214796 17845081 42.7% (totals) compressed uncompr. ratio uncompressed_name 7790 14732 47.2% stuff-Z3UO7-uvMANd 1806 3705 51.7% stuff-9ADk-DSBFQGQ 9020 16638 45.8% stuff-Caqfgao-Tc5F 7508 14361 47.8% stuff-kVUWa8ua4zxc I'm thinking that piping the output like so: find /local/amavis/virusmails -name *.gz -print | xargs zcat -l | grep -v compress | grep-v totals will do to suppress extraneous header/footer info Sure. This is also better than grabbing the second column unconditionally, which I suggested before :) ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
Date: Thu, 4 Jan 2007 04:46:43 +1100 (EST) From: Ian Smith [EMAIL PROTECTED] Subject: Re: Batch file question - average size of file in directory To: freebsd-questions@freebsd.org Cc: James Long [EMAIL PROTECTED], Kurt Buff [EMAIL PROTECTED] Message-ID: [EMAIL PROTECTED] Content-Type: TEXT/PLAIN; charset=US-ASCII ... you REALLY don't want to zcat 110 thousand files just to wc 'em, unless it's a benchmark :) Quite right! Well played. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
On Wed, 3 Jan 2007, Kurt Buff wrote: On 1/3/07, Ian Smith [EMAIL PROTECTED] wrote: From: James Long [EMAIL PROTECTED] From: Kurt Buff [EMAIL PROTECTED] [..] I've got a directory with a large number of gzipped files in it (over 110k) along with a few thousand uncompressed files. If it were me I'd mv those into a bunch of subdirectories; things get really slow with more than 500 or so files per directory .. anyway .. I just store them for a while - delete them after two weeks if they're not needed again. The overhead isn't enough to worry about at this point. Fair enough. We once had a security webcam gadget ftp'ing images into a directory every minute, 1440/day, but a php script listing the files for display was timing out just on the 'ls' when over ~2000 files on a 2.4G P4, prompting better (in that case, directory per day) organisation. [..] while read fname; do -if file $fname | grep -q compressed +if file $fname | grep -q gzip compressed then - echo -n $(zcat $fname | wc -c)+ + echo -n $(gunzip -l $fname | grep -v comp | awk '{print $2}')+ That was off the top of my (then tired) head, and will of course barf if 'comp' appears anywhere in a filename; it should be 'grep -v ^comp'. Ah - yes, I think that's much better. I should have thought of awk. That's the extent of my awk-foo, see Giorgos' post for fancier stuff :) And thanks to James for the base script to bother playing with .. Cheers, Ian ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Batch file question - average size of file in directory
Message: 28 Date: Tue, 2 Jan 2007 10:20:08 -0800 From: Kurt Buff [EMAIL PROTECTED] Subject: Batch file question - average size of file in directory To: [EMAIL PROTECTED] Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=ISO-8859-1; format=flowed All, I don't even have a clue how to start this one, so am looking for a little help. I've got a directory with a large number of gzipped files in it (over 110k) along with a few thousand uncompressed files. I'd like to find the average uncompressed size of the gzipped files, and ignore the uncompressed files. How on earth would I go about doing that with the default shell (no bash or other shells installed), or in perl, or something like that. I'm no scripter of any great expertise, and am just stumbling over this trying to find an approach. Many thanks for any help, Kurt Hi, Kurt. Can I make some assumptions that simplify things? No kinky filenames, just [a-zA-Z0-9.]. My approach specifically doesn't like colons or spaces, I bet. Also, you say gzipped, so I'm assuming it's ONLY gzip, no bzip2, etc. Here's a first draft that might give you some ideas. It will output: foo.gz : 3456 bar.gz : 1048576 (etc.) find . -type f | while read fname; do file $fname | grep -q compressed echo $fname : $(zcat $fname | wc -c) done If you really need a script that will do the math for you, then pip the output of this into bc: #!/bin/sh find . -type f | { n=0 echo scale=2 echo -n ( while read fname; do if file $fname | grep -q compressed then echo -n $(zcat $fname | wc -c)+ n=$(($n+1)) fi done echo 0) / $n } That should give you the average decompressed size of the gzip'ped files in the current directory. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]