Re: Specifying multiple separators via FS or the -F command line flag - addendum

Bob Proulx Mon, 03 Dec 2007 20:32:11 -0800

cga2000 wrote:
> Here's a sample of how the multiple separators feature behaves:
> 
> [15:52:[EMAIL PROTECTED]:~]$ echo " one: two:three :four five" | awk -F "[: 
> ]" '{print "1 "$1; print "2 "$2; print "3 "$3; print "4 "$4; print "5 "$5; 
> print "6 "$6;print "7 "$7;print "8 "$8}'


Thanks for the small example.  (I just read your last posting and will
probably respond to it but this one was much easier.)

> 1
> 2 one
> 3
> 4 two
> 5 three
> 6
> 7 four
> 8 five
> 
> Doesn't seem very logical to me.  

Each field separator is splitting a field.  So for example -F_ on
"___" would delimit four fields.  But before we do down this path I
know what you want and we are going to do it differently to get there.

> When awk successfully tests for space or colon, the following characters
> are assumed NOT to be separators even if they have been defined as such
> via the -F flag -- eg. the <space> that follows "one:" is mapped to the
> $3 variable.
> 
> Is this the way it's supposed to work?

The way it is supposed to work is defined here:

  http://www.opengroup.org/onlinepubs/009695399/utilities/awk.html

Search for the section "Regular Expressions" where the the FS ERE is
discussed.

An extended regular expression can be used to separate fields by using
the -F ERE option or by assigning a string containing the expression
to the built-in variable FS. The default value of the FS variable
shall be a single <space>. The following describes FS behavior:

   1. If FS is a null string, the behavior is unspecified.
   2. If FS is a single character:
         a. If FS is <space>, skip leading and trailing <blank>s;
            fields shall be delimited by sets of one or more <blank>s.
         b. Otherwise, if FS is any other character c, fields shall be
            delimited by each single occurrence of c.
   3. Otherwise, the string value of FS shall be considered to be an
      extended regular expression. Each occurrence of a sequence
      matching the extended regular expression shall delimit fields.

As you can see the default splitting behavior on a single space is
done as a one-off special.  The space is different than any other
field separator.

What you probably want is option 3 above where the field separator is
an extended regular expression.  Try this:

  echo " one: two:three :four five" | awk -F "[: ]+" '{print "1 "$1; print "2 
"$2; print "3 "$3; print "4 "$4; print "5 "$5; print "6 "$6;print "7 "$7;print 
"8 "$8}'
  1 
  2 one
  3 two
  4 three
  5 four
  6 five
  7 
  8 

The -F"[: ]+" has a "+" now and will match one or more occurrences of
either character.  But there is still a difference because leading
field separators are not trimmed.  There are a couple of ways of
dealing with that but neither are particularly elegant.

  echo " one: two:three :four five" | awk -F "[: ]+" '{sub(FS,"",$0);print "1 
"$1; print "2 "$2; print "3 "$3; print "4 "$4; print "5 "$5; print "6 "$6;print 
"7 "$7;print "8 "$8}'
  1 one
  2 two
  3 three
  4 four
  5 five
  6 
  7 
  8 

This does a substitution across the line for the FS variable.  That is
the same as sub(/[: ]+/,"",$0); here but using FS ties it to -F
nicely.  The $0 can be omitted in this but I like to be explicit.

Hope this helps,
Bob

Re: Specifying multiple separators via FS or the -F command line flag - addendum

Reply via email to