Re: Batch file question - average size of file in directory

2007-01-03 Thread Giorgos Keramidas
On 2007-01-02 10:20, Kurt Buff [EMAIL PROTECTED] wrote:
 All,

 I don't even have a clue how to start this one, so am looking for a
 little help.

 I've got a directory with a large number of gzipped files in it (over
 110k) along with a few thousand uncompressed files.

 I'd like to find the average uncompressed size of the gzipped files,
 and ignore the uncompressed files.

 How on earth would I go about doing that with the default shell (no
 bash or other shells installed), or in perl, or something like that.
 I'm no scripter of any great expertise, and am just stumbling over
 this trying to find an approach.

You can probably use awk(1) or perl(1) to post-process the output of
gzip(1).

The gzip(1) utility, when run with the -cd options will uncompress the
compressed files and send the uncompressed data to standard output,
without actually affecting the on-disk copy of the compressed data.

It is easy then to pipe the uncompressed data to wc(1) to count the
'bytes' of the uncompressed data:

for fname in *.Z *.z *.gz; do
if test -f ${fname}; then
gzip -cd ${fname} | wc -c
fi
done

This will print the byte-size of the uncompressed output of gzip, for
all the files which are currently compressed.  Something like the
following could be its output:

  220381
 3280920

This can be piped into awk(1) for further processing, with something
like this:

for fname in *.Z *.gz; do
if test -f $fname; then
gzip -cd $fname | wc -c
fi
done | \
awk 'BEGIN {
min = -1; max = 0; total = 0;
}
{
total += $1;
if ($1  max) {
max = $1;
}
if (min == -1 || $1  min) {
min = $1;
}
}
END {
if (NR  0) {
printf min/avg/max file size = %d/%d/%d\n,
min, total / NR, max;
}
}'

With the same files as above, the output of this would be:

min/avg/max file size = 220381/1750650/3280920

With a slightly modified awk(1) script, you can even print a running
min/average/max count, following each line.  Mmodified lines marked with
a pipe character (`|') in their leftmost column below.  The '|'
characters are *not* part of the script itself.

for fname in *.Z *.gz; do
if test -f $fname; then
gzip -cd $fname | wc -c
fi
done | \
awk 'BEGIN {
min = -1; max = 0; total = 0;
|   printf %10s %10s %10s %10s\n,
|   SIZE, MIN, AVERAGE, MAX;
}
{
total += $1;
if ($1  max) {
max = $1;
}
if (min == -1 || $1  min) {
min = $1;
}
|   printf %10d %10d %10d %10d\n,
|   $1, min, total/NR, max;
}
END {
if (NR  0) {
|   printf %10s %10d %10d %10d\n,
|   TOTAL, min, total / NR, max;
}
}'

When run with the same set of two compressed files this will print:

  SIZEMINAVERAGEMAX
220381 220381 220381 220381
   3280920 22038117506503280920
 TOTAL 22038117506503280920

Please note though that with a sufficiently large set of files, awk(1)
may fail to count the total number of bytes correctly.  If this is the
case, it should be easy to write an equivalent Perl or Python script,
to take advantage of their big-number support.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-03 Thread Ian Smith
  Message: 17
  Date: Tue, 2 Jan 2007 19:50:01 -0800
  From: James Long [EMAIL PROTECTED]

   Message: 28
   Date: Tue, 2 Jan 2007 10:20:08 -0800
   From: Kurt Buff [EMAIL PROTECTED]

   I don't even have a clue how to start this one, so am looking for a little 
   help.
   
   I've got a directory with a large number of gzipped files in it (over
   110k) along with a few thousand uncompressed files.

If it were me I'd mv those into a bunch of subdirectories; things get
really slow with more than 500 or so files per directory .. anyway .. 

   I'd like to find the average uncompressed size of the gzipped files,
   and ignore the uncompressed files.
   
   How on earth would I go about doing that with the default shell (no
   bash or other shells installed), or in perl, or something like that.
   I'm no scripter of any great expertise, and am just stumbling over
   this trying to find an approach.
   
   Many thanks for any help,
   
   Kurt
  
  Hi, Kurt.

And hi, James,

  Can I make some assumptions that simplify things?  No kinky filenames, 
  just [a-zA-Z0-9.].  My approach specifically doesn't like colons or 
  spaces, I bet.  Also, you say gzipped, so I'm assuming it's ONLY gzip, 
  no bzip2, etc.
 
  Here's a first draft that might give you some ideas.  It will output:
  
  foo.gz : 3456
  bar.gz : 1048576
  (etc.)
  
  find . -type f | while read fname; do
file $fname | grep -q compressed  echo $fname : $(zcat $fname | wc 
  -c)
  done

 % file cat7/tuning.7.gz
 cat7/tuning.7.gz: gzip compressed data, from Unix

Good check, though grep gzip compressed excludes bzip2 etc.

But you REALLY don't want to zcat 110 thousand files just to wc 'em,
unless it's a benchmark :) .. may I suggest a slight speedup, template:

 % gunzip -l cat7/tuning.7.gz
 compressed  uncompr. ratio uncompressed_name
 13642 38421  64.5% cat7/tuning.7

  If you really need a script that will do the math for you, then
  pip the output of this into bc:
  
  #!/bin/sh
  
  find . -type f | {
  
  n=0
  echo scale=2
  echo -n (
  while read fname; do
-if file $fname | grep -q compressed
+if file $fname | grep -q gzip compressed
then
-  echo -n $(zcat $fname | wc -c)+
+  echo -n $(gunzip -l $fname | grep -v comp | awk '{print $2}')+ 
  n=$(($n+1))
fi
  done
  echo 0) / $n
  
  }
  
  That should give you the average decompressed size of the gzip'ped
  files in the current directory.

HTH, Ian

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-03 Thread Kurt Buff

On 1/2/07, James Long [EMAIL PROTECTED] wrote:
snip my problem description

Hi, Kurt.

Can I make some assumptions that simplify things?  No kinky filenames,
just [a-zA-Z0-9.].  My approach specifically doesn't like colons or
spaces, I bet.  Also, you say gzipped, so I'm assuming it's ONLY gzip,
no bzip2, etc.


Right, no other compression types - just .gz.

Here's a small snippet of the directory listing:

-rw-r-  1 kurt  kurt   108208 Dec 21 06:15 dummy-zKLQEWrDDOZh
-rw-r-  1 kurt  kurt24989 Dec 28 17:29 dummy-zfzaEjlURTU1
-rw-r-  1 kurt  kurt30596 Jan  2 19:37 stuff-0+-OvVrXcEoq.gz
-rw-r-  1 kurt  kurt 2055 Dec 22 20:25 stuff-0+19OXqwpEdH.gz
-rw-r-  1 kurt  kurt13781 Dec 30 03:53 stuff-0+1bMFK2XvlQ.gz
-rw-r-  1 kurt  kurt11485 Dec 20 04:40 stuff-0+5jriDIt0jc.gz



Here's a first draft that might give you some ideas.  It will output:

foo.gz : 3456
bar.gz : 1048576
(etc.)

find . -type f | while read fname; do
  file $fname | grep -q compressed  echo $fname : $(zcat $fname | wc -c)
done


If you really need a script that will do the math for you, then
pip the output of this into bc:

#!/bin/sh

find . -type f | {

n=0
echo scale=2
echo -n (
while read fname; do
  if file $fname | grep -q compressed
  then
echo -n $(zcat $fname | wc -c)+
n=$(($n+1))
  fi
done
echo 0) / $n

}

That should give you the average decompressed size of the gzip'ped
files in the current directory.



Hmmm

That's the same basic approach that Giogos took, to uncompress the
file and count bytes with wc. I'm liking the 'zcat -l' contstruct, as
it looks more flexible, but then I have to parse the output, probably
with grep and cut.

Time to put on my thinking cap - I'll get back to the list on this.

Kurt
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-03 Thread Kurt Buff

On 1/3/07, Ian Smith [EMAIL PROTECTED] wrote:

  Message: 17
  Date: Tue, 2 Jan 2007 19:50:01 -0800
  From: James Long [EMAIL PROTECTED]

   Message: 28
   Date: Tue, 2 Jan 2007 10:20:08 -0800
   From: Kurt Buff [EMAIL PROTECTED]

   I don't even have a clue how to start this one, so am looking for a little 
help.
  
   I've got a directory with a large number of gzipped files in it (over
   110k) along with a few thousand uncompressed files.

If it were me I'd mv those into a bunch of subdirectories; things get
really slow with more than 500 or so files per directory .. anyway ..


I just store them for a while - delete them after two weeks if they're
not needed again. The overhead isn't enough to worry about at this
point.


   I'd like to find the average uncompressed size of the gzipped files,
   and ignore the uncompressed files.
  
   How on earth would I go about doing that with the default shell (no
   bash or other shells installed), or in perl, or something like that.
   I'm no scripter of any great expertise, and am just stumbling over
   this trying to find an approach.
  
   Many thanks for any help,
  
   Kurt
 
  Hi, Kurt.

And hi, James,

  Can I make some assumptions that simplify things?  No kinky filenames,
  just [a-zA-Z0-9.].  My approach specifically doesn't like colons or
  spaces, I bet.  Also, you say gzipped, so I'm assuming it's ONLY gzip,
  no bzip2, etc.
 
  Here's a first draft that might give you some ideas.  It will output:
 
  foo.gz : 3456
  bar.gz : 1048576
  (etc.)
 
  find . -type f | while read fname; do
file $fname | grep -q compressed  echo $fname : $(zcat $fname | wc 
-c)
  done

 % file cat7/tuning.7.gz
 cat7/tuning.7.gz: gzip compressed data, from Unix

Good check, though grep gzip compressed excludes bzip2 etc.

But you REALLY don't want to zcat 110 thousand files just to wc 'em,
unless it's a benchmark :) .. may I suggest a slight speedup, template:

 % gunzip -l cat7/tuning.7.gz
 compressed  uncompr. ratio uncompressed_name
 13642 38421  64.5% cat7/tuning.7

  If you really need a script that will do the math for you, then
  pip the output of this into bc:
 
  #!/bin/sh
 
  find . -type f | {
 
  n=0
  echo scale=2
  echo -n (
  while read fname; do
-if file $fname | grep -q compressed
+if file $fname | grep -q gzip compressed
then
-  echo -n $(zcat $fname | wc -c)+
+  echo -n $(gunzip -l $fname | grep -v comp | awk '{print $2}')+
  n=$(($n+1))
fi
  done
  echo 0) / $n
 
  }
 
  That should give you the average decompressed size of the gzip'ped
  files in the current directory.

HTH, Ian



Ah - yes, I think that's much better. I should have thought of awk.

At some point, I'd like to do a bit more processing of file sizes,
such as trying to find out the number of IP packets each file would
take during an SMTP transaction, so that I could categorize overhead a
bit, but for now the average uncompressed file size is good enough.

Thanks again for your help!

Kurt
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-03 Thread Kurt Buff

On 1/2/07, Giorgos Keramidas [EMAIL PROTECTED] wrote:

On 2007-01-02 10:20, Kurt Buff [EMAIL PROTECTED] wrote:
You can probably use awk(1) or perl(1) to post-process the output of
gzip(1).

The gzip(1) utility, when run with the -cd options will uncompress the
compressed files and send the uncompressed data to standard output,
without actually affecting the on-disk copy of the compressed data.

It is easy then to pipe the uncompressed data to wc(1) to count the
'bytes' of the uncompressed data:

for fname in *.Z *.z *.gz; do
if test -f ${fname}; then
gzip -cd ${fname} | wc -c
fi
done

This will print the byte-size of the uncompressed output of gzip, for
all the files which are currently compressed.  Something like the
following could be its output:


I put together this one-liner after perusing 'man zcat':

find /local/amavis/virusmails -name *.gz -print | xargs zcat -l  out.txt

It puts out multiple instances of stuff like this:

compressed  uncompr. ratio uncompressed_name
1508  3470  57.0% stuff-7f+BIOFX1-qX
1660  3576  54.0% stuff-bsFK-yGcWyCm
9113 17065  46.7% stuff-os1MKlKGu8ky
...
...
...
10214796  17845081  42.7% (totals)
compressed  uncompr. ratio uncompressed_name
7790 14732  47.2% stuff-Z3UO7-uvMANd
1806  3705  51.7% stuff-9ADk-DSBFQGQ
9020 16638  45.8% stuff-Caqfgao-Tc5F
7508 14361  47.8% stuff-kVUWa8ua4zxc

I'm thinking that piping the output like so:

find /local/amavis/virusmails -name *.gz -print | xargs zcat -l |
grep -v compress | grep-v totals

will do to suppress extraneous header/footer info



This can be piped into awk(1) for further processing, with something
like this:

for fname in *.Z *.gz; do
if test -f $fname; then
gzip -cd $fname | wc -c
fi
done | \
awk 'BEGIN {
min = -1; max = 0; total = 0;
}
{
total += $1;
if ($1  max) {
max = $1;
}
if (min == -1 || $1  min) {
min = $1;
}
}
END {
if (NR  0) {
printf min/avg/max file size = %d/%d/%d\n,
min, total / NR, max;
}
}'

With the same files as above, the output of this would be:

min/avg/max file size = 220381/1750650/3280920

With a slightly modified awk(1) script, you can even print a running
min/average/max count, following each line.  Mmodified lines marked with
a pipe character (`|') in their leftmost column below.  The '|'
characters are *not* part of the script itself.

for fname in *.Z *.gz; do
if test -f $fname; then
gzip -cd $fname | wc -c
fi
done | \
awk 'BEGIN {
min = -1; max = 0; total = 0;
|   printf %10s %10s %10s %10s\n,
|   SIZE, MIN, AVERAGE, MAX;
}
{
total += $1;
if ($1  max) {
max = $1;
}
if (min == -1 || $1  min) {
min = $1;
}
|   printf %10d %10d %10d %10d\n,
|   $1, min, total/NR, max;
}
END {
if (NR  0) {
|   printf %10s %10d %10d %10d\n,
|   TOTAL, min, total / NR, max;
}
}'

When run with the same set of two compressed files this will print:

  SIZEMINAVERAGEMAX
220381 220381 220381 220381
   3280920 22038117506503280920
 TOTAL 22038117506503280920

Please note though that with a sufficiently large set of files, awk(1)
may fail to count the total number of bytes correctly.  If this is the
case, it should be easy to write an equivalent Perl or Python script,
to take advantage of their big-number support.


I'll try to parse and understand this, and see if I can modify it to
suit the output I'm currently generating.

Many thanks for the help!

Kurt
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-03 Thread Giorgos Keramidas
On 2007-01-03 10:42, Kurt Buff [EMAIL PROTECTED] wrote:
 On 1/2/07, James Long [EMAIL PROTECTED] wrote:
 snip my problem description
 Hi, Kurt.
 
 Can I make some assumptions that simplify things?  No kinky filenames,
 just [a-zA-Z0-9.].  My approach specifically doesn't like colons or
 spaces, I bet.  Also, you say gzipped, so I'm assuming it's ONLY gzip,
 no bzip2, etc.

 Right, no other compression types - just .gz.

 Here's a small snippet of the directory listing:

 -rw-r-  1 kurt  kurt   108208 Dec 21 06:15 dummy-zKLQEWrDDOZh
 -rw-r-  1 kurt  kurt24989 Dec 28 17:29 dummy-zfzaEjlURTU1
 -rw-r-  1 kurt  kurt30596 Jan  2 19:37 stuff-0+-OvVrXcEoq.gz
 -rw-r-  1 kurt  kurt 2055 Dec 22 20:25 stuff-0+19OXqwpEdH.gz
 -rw-r-  1 kurt  kurt13781 Dec 30 03:53 stuff-0+1bMFK2XvlQ.gz
 -rw-r-  1 kurt  kurt11485 Dec 20 04:40 stuff-0+5jriDIt0jc.gz

 Here's a first draft [...]

 Hmmm

 That's the same basic approach that Giogos took, to uncompress the
 file and count bytes with wc. I'm liking the 'zcat -l' contstruct, as
 it looks more flexible, but then I have to parse the output, probably
 with grep and cut.

Excellent.  I didn't know about the -l option of gzip(1) until today :)

You can easily extract the uncompressed size, because it's always in
column 2 and it contains only numeric digits:

gzip -l *.gz *.Z *.z | awk '{print $2}' | grep '[[:digit:]]\+'

Then you can feed the resulting stream of uncompressed sizes to the awk
script I sent before :)

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-03 Thread Giorgos Keramidas
On 2007-01-03 10:28, Kurt Buff [EMAIL PROTECTED] wrote:
 I put together this one-liner after perusing 'man zcat':
 
 find /local/amavis/virusmails -name *.gz -print | xargs zcat -l  out.txt
 
 It puts out multiple instances of stuff like this:
 
 compressed  uncompr. ratio uncompressed_name
 1508  3470  57.0% stuff-7f+BIOFX1-qX
 1660  3576  54.0% stuff-bsFK-yGcWyCm
 9113 17065  46.7% stuff-os1MKlKGu8ky
 ...
 ...
 ...
 10214796  17845081  42.7% (totals)
 compressed  uncompr. ratio uncompressed_name
 7790 14732  47.2% stuff-Z3UO7-uvMANd
 1806  3705  51.7% stuff-9ADk-DSBFQGQ
 9020 16638  45.8% stuff-Caqfgao-Tc5F
 7508 14361  47.8% stuff-kVUWa8ua4zxc
 
 I'm thinking that piping the output like so:
 
 find /local/amavis/virusmails -name *.gz -print | xargs zcat -l |
 grep -v compress | grep-v totals
 
 will do to suppress extraneous header/footer info

Sure.  This is also better than grabbing the second column
unconditionally, which I suggested before :)
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-03 Thread James Long
 Date: Thu, 4 Jan 2007 04:46:43 +1100 (EST)
 From: Ian Smith [EMAIL PROTECTED]
 Subject: Re: Batch file question - average size of file in directory
 To: freebsd-questions@freebsd.org
 Cc: James Long [EMAIL PROTECTED], Kurt Buff [EMAIL PROTECTED]
 Message-ID:
   [EMAIL PROTECTED]
 Content-Type: TEXT/PLAIN; charset=US-ASCII
 
 ... you REALLY don't want to zcat 110 thousand files just to wc 'em,
 unless it's a benchmark :)

Quite right!  Well played.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-03 Thread Ian Smith
On Wed, 3 Jan 2007, Kurt Buff wrote:

  On 1/3/07, Ian Smith [EMAIL PROTECTED] wrote:
 From: James Long [EMAIL PROTECTED]
  From: Kurt Buff [EMAIL PROTECTED]

[..]

  I've got a directory with a large number of gzipped files in it (over
  110k) along with a few thousand uncompressed files.
  
   If it were me I'd mv those into a bunch of subdirectories; things get
   really slow with more than 500 or so files per directory .. anyway ..
  
  I just store them for a while - delete them after two weeks if they're
  not needed again. The overhead isn't enough to worry about at this
  point.

Fair enough.  We once had a security webcam gadget ftp'ing images into a
directory every minute, 1440/day, but a php script listing the files for
display was timing out just on the 'ls' when over ~2000 files on a 2.4G
P4, prompting better (in that case, directory per day) organisation. 

[..]

 while read fname; do
   -if file $fname | grep -q compressed
   +if file $fname | grep -q gzip compressed
   then
   -  echo -n $(zcat $fname | wc -c)+
   +  echo -n $(gunzip -l $fname | grep -v comp | awk '{print $2}')+

That was off the top of my (then tired) head, and will of course barf if
'comp' appears anywhere in a filename; it should be 'grep -v ^comp'.

  Ah - yes, I think that's much better. I should have thought of awk.

That's the extent of my awk-foo, see Giorgos' post for fancier stuff :)

And thanks to James for the base script to bother playing with ..

Cheers, Ian

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Batch file question - average size of file in directory

2007-01-02 Thread James Long
 Message: 28
 Date: Tue, 2 Jan 2007 10:20:08 -0800
 From: Kurt Buff [EMAIL PROTECTED]
 Subject: Batch file question - average size of file in directory
 To: [EMAIL PROTECTED]
 Message-ID:
   [EMAIL PROTECTED]
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 
 All,
 
 I don't even have a clue how to start this one, so am looking for a little 
 help.
 
 I've got a directory with a large number of gzipped files in it (over
 110k) along with a few thousand uncompressed files.
 
 I'd like to find the average uncompressed size of the gzipped files,
 and ignore the uncompressed files.
 
 How on earth would I go about doing that with the default shell (no
 bash or other shells installed), or in perl, or something like that.
 I'm no scripter of any great expertise, and am just stumbling over
 this trying to find an approach.
 
 Many thanks for any help,
 
 Kurt

Hi, Kurt.

Can I make some assumptions that simplify things?  No kinky filenames, 
just [a-zA-Z0-9.].  My approach specifically doesn't like colons or 
spaces, I bet.  Also, you say gzipped, so I'm assuming it's ONLY gzip, 
no bzip2, etc.

Here's a first draft that might give you some ideas.  It will output:

foo.gz : 3456
bar.gz : 1048576
(etc.)

find . -type f | while read fname; do
  file $fname | grep -q compressed  echo $fname : $(zcat $fname | wc -c)
done


If you really need a script that will do the math for you, then
pip the output of this into bc:

#!/bin/sh

find . -type f | {

n=0
echo scale=2
echo -n (
while read fname; do
  if file $fname | grep -q compressed
  then
echo -n $(zcat $fname | wc -c)+
n=$(($n+1))
  fi
done
echo 0) / $n

}

That should give you the average decompressed size of the gzip'ped
files in the current directory.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]