On 1/2/07, Giorgos Keramidas <[EMAIL PROTECTED]> wrote:
On 2007-01-02 10:20, Kurt Buff <[EMAIL PROTECTED]> wrote:
You can probably use awk(1) or perl(1) to post-process the output of
gzip(1).
The gzip(1) utility, when run with the -cd options will uncompress the
compressed files and send the uncompressed data to standard output,
without actually affecting the on-disk copy of the compressed data.
It is easy then to pipe the uncompressed data to wc(1) to count the
'bytes' of the uncompressed data:
for fname in *.Z *.z *.gz; do
if test -f "${fname}"; then
gzip -cd "${fname}" | wc -c
fi
done
This will print the byte-size of the uncompressed output of gzip, for
all the files which are currently compressed. Something like the
following could be its output:
I put together this one-liner after perusing 'man zcat':
find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l >> out.txt
It puts out multiple instances of stuff like this:
compressed uncompr. ratio uncompressed_name
1508 3470 57.0% stuff-7f+BIOFX1-qX
1660 3576 54.0% stuff-bsFK-yGcWyCm
9113 17065 46.7% stuff-os1MKlKGu8ky
...
...
...
10214796 17845081 42.7% (totals)
compressed uncompr. ratio uncompressed_name
7790 14732 47.2% stuff-Z3UO7-uvMANd
1806 3705 51.7% stuff-9ADk-DSBFQGQ
9020 16638 45.8% stuff-Caqfgao-Tc5F
7508 14361 47.8% stuff-kVUWa8ua4zxc
I'm thinking that piping the output like so:
find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l |
grep -v compress | grep-v totals
will do to suppress extraneous header/footer info
This can be piped into awk(1) for further processing, with something
like this:
for fname in *.Z *.gz; do
if test -f "$fname"; then
gzip -cd "$fname" | wc -c
fi
done | \
awk 'BEGIN {
min = -1; max = 0; total = 0;
}
{
total += $1;
if ($1 > max) {
max = $1;
}
if (min == -1 || $1 < min) {
min = $1;
}
}
END {
if (NR > 0) {
printf "min/avg/max file size = %d/%d/%d\n",
min, total / NR, max;
}
}'
With the same files as above, the output of this would be:
min/avg/max file size = 220381/1750650/3280920
With a slightly modified awk(1) script, you can even print a running
min/average/max count, following each line. Mmodified lines marked with
a pipe character (`|') in their leftmost column below. The '|'
characters are *not* part of the script itself.
for fname in *.Z *.gz; do
if test -f "$fname"; then
gzip -cd "$fname" | wc -c
fi
done | \
awk 'BEGIN {
min = -1; max = 0; total = 0;
| printf "%10s %10s %10s %10s\n",
| "SIZE", "MIN", "AVERAGE", "MAX";
}
{
total += $1;
if ($1 > max) {
max = $1;
}
if (min == -1 || $1 < min) {
min = $1;
}
| printf "%10d %10d %10d %10d\n",
| $1, min, total/NR, max;
}
END {
if (NR > 0) {
| printf "%10s %10d %10d %10d\n",
| "TOTAL", min, total / NR, max;
}
}'
When run with the same set of two compressed files this will print:
SIZE MIN AVERAGE MAX
220381 220381 220381 220381
3280920 220381 1750650 3280920
TOTAL 220381 1750650 3280920
Please note though that with a sufficiently large set of files, awk(1)
may fail to count the total number of bytes correctly. If this is the
case, it should be easy to write an equivalent Perl or Python script,
to take advantage of their big-number support.
I'll try to parse and understand this, and see if I can modify it to
suit the output I'm currently generating.
Many thanks for the help!
Kurt
_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"