Matthew Jarvis wrote:

> Question (remember - I'm stoopid): where is this script getting the IP 
> addresses from? I don't understand the first part of:
> 
>   $ sub='s/.*[^0-9]([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+).*/\1/p'
>   $ sed -rn "$sub" logfile1 logfile2 logfile3

Answer (remember - we're all stoopid about different things): I had
assumed (maybe incorrectly) that you were starting with apache log
files or something.  My apache error log has lines looking like
this.

  [Mon Feb 25 17:37:23 2002] [error] [client 149.225.92.162] Directory index 
forbidden by rule: /home/kbob/web/jogger-egg/docs/kbob/dist/

The IP address is in the middle of a bunch of other text.

Regular expressions.  You went through the CS curriculum, I think, so
you probably met them in a formal languages class.  They're a big part
of Unix culture and you have to use them.

Sed is ancient, first written around 1976.  It uses an ancient dialect
of regular expressions, which is hard to read.  I'm a little
surprised that the GNU project didn't plug in Perl-compatible regexps
(PCRE) when they reimplemented it.  But the man page didn't mention
them.

Anyway, sed is the stream editor.  It reads input files or standard
input and writes to standard output.  I used the shell variable $sub to
hold a sed substitution command.  Normally you'd just put the sed
command(s) in line, but this one was big and ugly enough that it
obfuscated the rest of the story.

The substitute command says this.
        For every line of text in the file(s), look for this
        pattern. If you find it, replace it with something else, and
        write the resulting line of text to standard output.

The general syntax is:
        s/pattern/replacement/p
        s means substitution command.
        p means print result.

The pattern part of the command is this:
        .*[^0-9]([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+).*

Some common idioms:
        [0-9]  matches a digit.
        [0-9]+ matches one or more digits.
        [^0-9] matches a non-digit.
        .      matches any character
        .*     matches zero or more of any character (i.e., any string)
        \.     matches a period character
        (...)  save the text that matches ... in a temporary variable.
               the temp variables are called \1, \2, \3 ...

So, in English, the pattern says,
        Match anything ending with a nondigit.  Then save into \1 the
        sequence digits, dot, digits, dot, digits, dot, digits.  (In
        other words, nn.nn.nn.nn) Then match everything else in the
        line.

The whole command says this.
        s/(pattern)/\1/p
The pattern matches the whole line - that's why it starts and ends
with .* .  We replace the whole line with just the IP address and
print it.

In PCRE, you can write it a little more simply.  (Just a little)
        .*?(\d+\.\d+\.\d+\.\d+\).*

The .*? idiom is a non-greedy match.  Read the perlre man page.

Perl also has extended regular expressions, which nobody uses.  In
that syntax, you can use whitespace and comments.  That is allegedly
more readable.  Here's our command in extended syntax.

        s/
            .*?                 # discard leading trash
            (
                [[:digit:]]+    # first octet
                \.
                [[:digit:]]+    # second octet
                \.
                [[:digit:]]+    # third octet
                \.
                [[:digit:]]+    # fourth octet
            )
            .*                  # discard trailing trash
        /$1/x;

You probably ought to get a copy of Jeff Friedl's regular expression
book sometime.  Other resources:

        info sed
        man perlre

-- 
Bob Miller                              K<bob>
                                        [EMAIL PROTECTED]
_______________________________________________
EUGLUG mailing list
[email protected]
http://www.euglug.org/mailman/listinfo/euglug

Reply via email to