[gentoo-doc-cvs] cvs commit: l-awk1.xml

Xavier Neys Thu, 28 Jul 2005 01:04:22 -0700

neysx       05/07/28 08:04:04

  Added:       xml/htdocs/doc/en/articles l-awk1.xml l-awk2.xml l-awk3.xml
  Log:
  #99260 xmlified awk articles


Revision  Changes    Path
1.1                  xml/htdocs/doc/en/articles/l-awk1.xml

file : 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
plain: 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

Index: l-awk1.xml
===================================================================
<?xml version='1.0' encoding="UTF-8"?>
<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk1.xml,v 1.1 
2005/07/28 08:04:04 neysx Exp $ -->
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">

<guide link="/doc/en/articles/l-awk1.xml">
<title>Awk by example, Part 1</title>

<author title="Author">
  <mail link="[EMAIL PROTECTED]">Daniel Robbins</mail>
</author>
<author title="Editor">
  <mail link="[EMAIL PROTECTED]">Åukasz Damentko</mail>
</author>

<abstract>
Awk is a very nice language with a very strange name. In this first article of a
three-part series, Daniel Robbins will quickly get your awk programming skills
up to speed. As the series progresses, more advanced topics will be covered,
culminating with an advanced real-world awk application demo.
</abstract>

<!-- The original version of this article was published on IBM developerWorks,
and is property of Westtech Information Services. This document is an updated
version of the original article, and contains various improvements made by the
Gentoo Linux Documentation team -->

<version>1.0</version>
<date>2005-07-15</date>

<chapter>
<title>An intro to the great language with the strange name</title>
<section>
<title>In defense of awk</title>
<body>

<note>
The original version of this article was published on IBM developerWorks, and is
property of Westtech Information Services. This document is an updated version
of the original article, and contains various improvements made by the Gentoo
Linux Documentation team.
</note>

<p>
In this series of articles, I'm going to turn you into a proficient awk coder.
I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the
GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with
the language may hear "awk" and think of a mess of code so backwards and
antiquated that it's capable of driving even the most knowledgeable UNIX guru to
the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for
coffee machine).
</p>

<p>
Sure, awk doesn't have a great name. But it is a great language. Awk is geared
toward text processing and report generation, yet features many well-designed
features that allow for serious programming. And, unlike some languages, awk's
syntax is familiar, and borrows some of the best parts of languages like C,
python, and bash (although, technically, awk was created before both python and
bash). Awk is one of those languages that, once learned, will become a key part
of your strategic coding arsenal.
</p>

</body>
</section>
<section>
<title>The first awk</title>
<body>

<p>
You should see the contents of your <path>/etc/passwd</path> file appear before
your eyes.  Now, for an explanation of what awk did. When we called awk, we
specified <path>/etc/passwd</path> as our input file. When we executed awk, it
evaluated the print command for each line in <path>/etc/passwd</path>, in
order. All output is sent to stdout, and we get a result identical to catting
<path>/etc/pass</path>.
</p>

<p>
Now, for an explanation of the { print } code block. In awk, curly braces are
used to group blocks of code together, similar to C. Inside our block of code,
we have a single print command. In awk, when a print command appears by itself,
the full contents of the current line are printed.
</p>

<pre caption="Printing the current line">
$ <i>awk '{ print $0 }' /etc/passwd</i>
$ <i>awk '{ print "" }' /etc/passwd</i>
</pre>

<p>
In awk, the $0 variable represents the entire current line, so print and print
$0 do exactly the same thing.
</p>

<pre caption="Filling the screen with some text">
$ <i>awk '{ print "hiya" }' /etc/passwd</i>
</pre>

</body>
</section>
<section>
<title>Multiple fields</title>
<body>

<pre caption="print $1">
$ <i>awk -F":" '{ print $1 $3 }' /etc/passwd</i>
halt7
operator11
root0
shutdown6
sync5
bin1
<comment>....etc.</comment>
</pre>

<pre caption="print $1 $3">
$ <i>awk -F":" '{ print $1 " " $3 }' /etc/passwd</i>
</pre>

<pre caption="$1$3">
$ <i>awk -F":" '{ print "username: " $1 "\t\tuid:" $3" }' /etc/passwd</i>
username: halt          uid:7
username: operator      uid:11
username: root          uid:0
username: shutdown      uid:6
username: sync          uid:5
username: bin           uid:1
<comment>....etc.</comment>
</pre>

</body>
</section>
<section>
<title>External scripts</title>
<body>

<pre caption="Sample script">
BEGIN { FS=":" }
{ print $1 }
</pre>

<p>
The difference between these two methods has to do with how we set the field
separator. In this script, the field separator is specified within the code
itself (by setting the FS variable), while our previous example set FS by
passing the -F":" option to awk on the command line. It's generally best to set
the field separator inside the script itself, simply because it means you have
one less command line argument to remember to type. We'll cover the FS variable
in more detail later in this article.
</p>

</body>
</section>
<section>
<title>The BEGIN and END blocks</title>
<body>

<p>
Normally, awk executes each block of your script's code once for each input
line. However, there are many programming situations where you may need to
execute initialization code before awk begins processing the text from the input
file. For such situations, awk allows you to define a BEGIN block. We used a
BEGIN block in the previous example. Because the BEGIN block is evaluated before
awk starts processing the input file, it's an excellent place to initialize the
FS (field separator) variable, print a heading, or initialize other global
variables that you'll reference later in the program.
</p>

<p>
Awk also provides another special block, called the END block. Awk executes this
block after all lines in the input file have been processed. Typically, the END
block is used to perform final calculations or print summaries that should
appear at the end of the output stream.
</p>

</body>
</section>
<section>
<title>Regular expressions and blocks</title>
<body>

<pre caption="Regular expressions and blocks">
/foo/ { print }
/[0-9]+\.[0-9]*/ { print }
</pre>

</body>
</section>
<section>
<title>Expressions and blocks</title>
<body>

<pre caption="fredprint">
$1 == "fred" { print $3 }
</pre>

<pre caption="root">
$5 ~ /root/ { print $3 }
</pre>



1.1                  xml/htdocs/doc/en/articles/l-awk2.xml

file : 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
plain: 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

Index: l-awk2.xml
===================================================================
<?xml version='1.0' encoding="UTF-8"?>
<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk2.xml,v 1.1 
2005/07/28 08:04:04 neysx Exp $ -->
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">

<guide link="/doc/en/articles/l-awk2.xml">
<title>Awk by example, Part 2</title>

<author title="Author">
  <mail link="[EMAIL PROTECTED]">Daniel Robbins</mail>
</author>
<author title="Editor">
  <mail link="[EMAIL PROTECTED]">Åukasz Damentko</mail>
</author>

<abstract>
In this sequel to his previous intro to awk, Daniel Robbins continues to explore
awk, a great language with a strange name. Daniel will show you how to handle
multi-line records, use looping constructs, and create and use awk arrays. By
the end of this article, you'll be well versed in a wide range of awk features,
and you'll be ready to write your own powerful awk scripts.
</abstract>

<!-- The original version of this article was published on IBM developerWorks,
and is property of Westtech Information Services. This document is an updated
version of the original article, and contains various improvements made by the
Gentoo Linux Documentation team -->

<version>1.0</version>
<date>2005-07-27</date>

<chapter>
<title>Records, loops, and arrays</title>
<section>
<title>Multi-line records</title>
<body>

<note>
The original version of this article was published on IBM developerWorks, and is
property of Westtech Information Services. This document is an updated version
of the original article, and contains various improvements made by the Gentoo
Linux Documentation team.
</note>

<p>
Awk is an excellent tool for reading in and processing structured data, such as
the system's <path>/etc/passwd</path> file. <path>/etc/passwd</path> is the UNIX
user database, and is a colon-delimited text file, containing a lot of important
information, including all existing user accounts and user IDs, among other
things. In <uri link="/doc/en/articles/l-awk1.xml">my previous article</uri>, I
showed you how awk could easily parse this file. All we had to do was to set the
FS (field separator) variable to ":".
</p>

<p>
By setting the FS variable correctly, awk can be configured to parse almost any
kind of structured data, as long as there is one record per line. However, just
setting FS won't do us any good if we want to parse a record that exists over
multiple lines. In these situations, we also need to modify the RS record
separator variable. The RS variable tells awk when the current record ends and a
new record begins.
</p>

<p>
As an example, let's look at how we'd handle the task of processing an address
list of Federal Witness Protection Program participants:
</p>

<pre caption="Sample entry from Federal Witness Protection Program participants 
list">
Jimmy the Weasel
100 Pleasant Drive
San Francisco, CA 12345
Big Tony
200 Incognito Ave.
Suburbia, WA 67890
</pre>

<p>
Ideally, we'd like awk to recognize each 3-line address as an individual record,
rather than as three separate records. It would make our code a lot simpler if
awk would recognize the first line of the address as the first field ($1), the
street address as the second field ($2), and the city, state, and zip code as
field $3. The following code will do just what we want:
</p>

<pre caption="Making one field from the address">
BEGIN {
    FS="\n"
    RS=""
}
</pre>

<p>
Above, setting FS to "\n" tells awk that each field appears on its own line. By
setting RS to "", we also tell awk that each address record is separated by a
blank line. Once awk knows how the input is formatted, it can do all the parsing
work for us, and the rest of the script is simple. Let's look at a complete
script that will parse this address list and print out each address record on a
single line, separating each field with a comma.
</p>

<pre caption="Complete script">
BEGIN {
    FS="\n"
    RS=""
}
{ print $1 ", " $2 ", " $3 }
</pre>


<p>
If this script is saved as <path>address.awk</path>, and the address data is
stored in a file called <path>address.txt</path>, you can execute this script by
typing <c>awk -f address.awk address.txt</c>. This code produces the following
output:
</p>

<pre caption="The script's output">
Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
Big Tony, 200 Incognito Ave., Suburbia, WA 67890
</pre>

</body>
</section>
<section>
<title>OFS and ORS</title>
<body>

<p>
In address.awk's print statement, you can see that awk concatenates (joins)
strings that are placed next to each other on a line. We used this feature to
insert a comma and a space (", ") between the three address fields that appeared
on the line. While this method works, it's a bit ugly looking. Rather than
inserting literal ", " strings between our fields, we can have awk do it for us
by setting a special awk variable called OFS. Take a look at this code snippet.
</p>

<pre caption="Sample code snippet">
print "Hello", "there", "Jim!"
</pre>

<p>
The commas on this line are not part of the actual literal strings. Instead,
they tell awk that "Hello", "there", and "Jim!" are separate fields, and that
the OFS variable should be printed between each string. By default, awk produces
the following output:
</p>

<pre caption="Output produced by awk">
Hello there Jim!
</pre>

<p>
This shows us that by default, OFS is set to " ", a single space. However, we
can easily redefine OFS so that awk will insert our favorite field separator.
Here's a revised version of our original <path>address.awk</path> program that
uses OFS to output those intermediate ", " strings:
</p>

<pre caption="Redefining OFS">
BEGIN {
    FS="\n"
    RS=""
    OFS=", "
}
{ print $1, $2, $3 }
</pre>

<p>
Awk also has a special variable called ORS, called the "output record
separator". By setting ORS, which defaults to a newline ("\n"), we can control
the character that's automatically printed at the end of a print statement. The
default ORS value causes awk to output each new print statement on a new line.
If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or,
if we wanted records to be separated by a single space (and no newline), we
would set ORS to " ".
</p>

</body>
</section>
<section>
<title>Multi-line to tabbed</title>
<body>

<p>
Let's say that we wrote a script that converted our address list to a
single-line per record, tab-delimited format for import into a spreadsheet.
After using a slightly modified version of <path>address.awk</path>, it would
become clear that our program only works for three-line addresses. If awk
encountered the following address, the fourth line would be thrown away and not
printed:
</p>

<pre caption="Sample entry">
Cousin Vinnie
Vinnie's Auto Shop
300 City Alley
Sosueme, OR 76543
</pre>




1.1                  xml/htdocs/doc/en/articles/l-awk3.xml

file : 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo
plain: 
http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo

Index: l-awk3.xml
===================================================================
<?xml version='1.0' encoding="UTF-8"?>
<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk3.xml,v 1.1 
2005/07/28 08:04:04 neysx Exp $ -->
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">

<guide link="/doc/en/articles/l-awk3.xml">
<title>Awk by example, Part 3</title>

<author title="Author">
  <mail link="[EMAIL PROTECTED]">Daniel Robbins</mail>
</author>
<author title="Editor">
  <mail link="[EMAIL PROTECTED]">Åukasz Damentko</mail>
</author>

<abstract>
In this sequel to his previous intro to awk, Daniel Robbins continues to explore
awk, a great language with a strange name. Daniel will show you how to handle
multi-line records, use looping constructs, and create and use awk arrays. By
the end of this article, you'll be well versed in a wide range of awk features,
and you'll be ready to write your own powerful awk scripts.
</abstract>

<!-- The original version of this article was published on IBM developerWorks,
and is property of Westtech Information Services. This document is an updated
version of the original article, and contains various improvements made by the
Gentoo Linux Documentation team -->

<version>1.0</version>
<date>2005-07-27</date>

<chapter>
<title>String functions and ... checkbooks?</title>
<section>
<title>Formatting output</title>
<body>

<p>
While awk's print statement does do the job most of the time, sometimes more is
needed. For those times, awk offers two good old friends called printf() and
sprintf(). Yes, these functions, like so many other awk parts, are identical to
their C counterparts. printf() will print a formatted string to stdout, while
sprintf() returns a formatted string that can be assigned to a variable. If
you're not familiar with printf() and sprintf(), an introductory C text will
quickly get you up to speed on these two essential printing functions. You can
view the printf() man page by typing "man 3 printf" on your Linux system.
</p>

<p>
Here's some sample awk sprintf() and printf() code. As you can see, everything
looks almost identical to C.
</p>

<pre caption="Sample awk sprintf() and printf() code">
x=1
b="foo"
printf("%s got a %d on the last test\n","Jim",83)
myout=("%s-%d",b,x)
print myout
</pre>

<p>
This code will print:
</p>

<pre caption="Code output">
Jim got a 83 on the last test
foo-1
</pre>

</body>
</section>
<section>
<title>String functions</title>
<body>

<p>
Awk has a plethora of string functions, and that's a good thing. In awk, you
really need string functions, since you can't treat a string as an array of
characters as you can in other languages like C, C++, and Python. For example,
if you execute the following code:
</p>

<pre caption="Example code">
mystring="How are you doing today?"
print mystring[3]
</pre>

<p>
You'll receive an error that looks something like this:
</p>

<pre caption="Example code error">
awk: string.gawk:59: fatal: attempt to use scalar as array
</pre>

<p>
Oh, well. While not as convenient as Python's sequence types, awk's string
functions get the job done. Let's take a look at them.
</p>

<p>
First, we have the basic length() function, which returns the length of a
string. Here's how to use it:
</p>

<pre caption="length() function example">
print length(mystring)
</pre>

<p>
This code will print the value:
</p>

<pre caption="Printed value">
24
</pre>

<p>
OK, let's keep going. The next string function is called index, and will return
the position of the occurrence of a substring in another string, or it will
return 0 if the string isn't found. Using mystring, we can call it this way:
</p>

<pre caption="index() funtion example">
print index(mystring,"you")
</pre>

<p>
Awk prints:
</p>

<pre caption="Function output">
9
</pre>

<p>
We move on to two more easy functions, tolower() and toupper(). As you might
guess, these functions will return the string with all characters converted to
lowercase or uppercase respectively. Notice that tolower() and toupper() return
the new string, and don't modify the original. This code:
</p>

<pre caption="Converting strings to lower or uppercase">
print tolower(mystring)
print toupper(mystring)
print mystring
</pre>

<p>
....will produce this output:
</p>

<pre caption="Output">
how are you doing today?
HOW ARE YOU DOING TODAY?
How are you doing today?
</pre>

<p>
So far so good, but how exactly do we select a substring or even a single
character from a string? That's where substr() comes in. Here's how to call
substr():
</p>

<pre caption="substr() function example">
mysub=substr(mystring,startpos,maxlen)
</pre>

<p>
mystring should be either a string variable or a literal string from which you'd
like to extract a substring. startpos should be set to the starting character
position, and maxlen should contain the maximum length of the string you'd like
to extract. Notice that I said maximum length; if length(mystring) is shorter
than startpos+maxlen, your result will be truncated. substr() won't modify the
original string, but returns the substring instead. Here's an example:
</p>

<pre caption="Another example">
print substr(mystring,9,3)
</pre>

<p>
Awk will print:
</p>

<pre caption="What awk prints">
you
</pre>

<p>
If you regularly program in a language that uses array indices to access parts
of a string (and who doesn't), make a mental note that substr() is your awk
substitute. You'll need to use it to extract single characters and substrings;
because awk is a string-based language, you'll be using it often.
</p>

<p>
Now, we move on to some meatier functions, the first of which is called match().
match() is a lot like index(), except instead of searching for a substring like



-- 
[email protected] mailing list

[gentoo-doc-cvs] cvs commit: l-awk1.xml

Reply via email to