Re: Explanation of magic formula (Was: script for selective sigs)

Ralph Shumaker Sat, 23 Feb 2008 01:50:25 -0800

SJS wrote:

begin  quoting Ralph Shumaker as of Fri, Feb 22, 2008 at 08:23:48PM -0800:

James G. Sack (jim) wrote:

SJS wrote:

[snip]

I can't argue, and in fact, confess to similar experiences and
inclinations. I'm thinking that if I (or you) didn't do this sort of
thing so frequently, or hadn't passed a certain threshold of cumulative
usage with these tools, that each little task like you mention could
become more daunting. Maybe we have to ask Ralph?
I've never seen the guts of a sed program (or awk for that matter). ButI learned just what I needed to know of sed in order to go thru a LOT oftext and make substitutions. I even used tr to get rid of all the linefeeds. (I think I replaced them with something else as markers, and thenhad sed replace all surplus line feed markers.)
Altho, I must admit that anytime I see some magic command line formulain the kplug lists, I do try to figure it out. Sometimes I do, sometimesnot. (SS' magic formula above is beyond me.)


Whenever I post code  that doesn't make sense, go ahead and request an
explanation. Please.

[ chop - rearrange ]

For example, I wanted to quickly get an idea of how many duplicate files
(with different names) I had in a directory, to see if it was worthwhile
trying to identify the duplicates.  Since I don't know of a command that
will do this out of the box, I evolved an answer without much effort:

% cksum * | awk '{print $1}' | sort | uniq -c | sed -e 's/ *1 .*//' \
         | sort -u | awk '{print $2}'


Well, cksum prints out the checksum, size, and filename of its argument(s).
So "cksum *" prints out the checksums, sizes, and filenames of all the
files in the current directory.

Then the "awk '{print $1}'" prints out the first "field" for each line,
where a field is represented by one or more whitespace characters. This
gives me a list of just the checksums for each of the files, one
checksum per line.

Then the sort command sorts all of the checksums, so that files that
have the same checksum (and thus presumably the same content) will be
adjacent.

Then the uniq command deletes adjacent duplicate lines. The -c option
tells uniq to count how many lines total it has seen, and to print that
count out before it prints out the (now unique) line.

The sed command takes a "-e expression" argument, where the expression
is just like what you'd give vim after hitting <esc>: -- _s_ubstitute
any number of spaces, followed by "1", another space, followed by
anything, and replace it with nothing.  Basically, any checksum that
only occurred once (and thus represents a file that has no duplicates)
will have a count of 1, and this sed command will turn it into a blank
line.

Then we sort again, which collects all of the blank lines together. The
-u option to sort does basically the same thing as uniq -- identical
lines are collapsed into just one.  Since the only lines that are going
to be identical are the blank lines, this has the effect of getting
rid of the blank lines.

The final awk command prints out the second field. By this time in the
pipeline, the first field is the count, and the second field is the
checksum, so this will just print out the checksum without the counts.

As useful thing to do is to make a scratch directory and copy some
files into it, making sure to copy some files twice to different names.
Then run successively longer subsets of the "magic formula" -- start
with "cksum *", and add one one more pipe segment each time.


I may have to do that to help burn this into the engrams.

--
Hope that helps
And not confuses.
Stewart Stremler

No confusion at all. I understand this stuff ~just~ enuf to be able tofollow the explanations (usually), which is much better than even just afew years ago. (Hopefully, Carl will explain his enhanced version.)(Who knows, maybe I can figure it out.)


Thanks for spelling it out.

(Often I will find out just by following the thread, reading theresponses, maybe about 30 to 40% of the time.)




--
Ralph

--------------------

An enlightened dictatorship is, without question, the most efficientform of government.

--Andrew Lentvorski

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Re: Explanation of magic formula (Was: script for selective sigs)

Reply via email to