neysx 05/07/28 08:04:04 Added: xml/htdocs/doc/en/articles l-awk1.xml l-awk2.xml l-awk3.xml Log: #99260 xmlified awk articles
Revision Changes Path 1.1 xml/htdocs/doc/en/articles/l-awk1.xml file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk1.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk1.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo Index: l-awk1.xml =================================================================== <?xml version='1.0' encoding="UTF-8"?> <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk1.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ --> <!DOCTYPE guide SYSTEM "/dtd/guide.dtd"> <guide link="/doc/en/articles/l-awk1.xml"> <title>Awk by example, Part 1</title> <author title="Author"> <mail link="[EMAIL PROTECTED]">Daniel Robbins</mail> </author> <author title="Editor"> <mail link="[EMAIL PROTECTED]">Åukasz Damentko</mail> </author> <abstract> Awk is a very nice language with a very strange name. In this first article of a three-part series, Daniel Robbins will quickly get your awk programming skills up to speed. As the series progresses, more advanced topics will be covered, culminating with an advanced real-world awk application demo. </abstract> <!-- The original version of this article was published on IBM developerWorks, and is property of Westtech Information Services. This document is an updated version of the original article, and contains various improvements made by the Gentoo Linux Documentation team --> <version>1.0</version> <date>2005-07-15</date> <chapter> <title>An intro to the great language with the strange name</title> <section> <title>In defense of awk</title> <body> <note> The original version of this article was published on IBM developerWorks, and is property of Westtech Information Services. This document is an updated version of the original article, and contains various improvements made by the Gentoo Linux Documentation team. </note> <p> In this series of articles, I'm going to turn you into a proficient awk coder. I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it's capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine). </p> <p> Sure, awk doesn't have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk's syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal. </p> </body> </section> <section> <title>The first awk</title> <body> <p> You should see the contents of your <path>/etc/passwd</path> file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified <path>/etc/passwd</path> as our input file. When we executed awk, it evaluated the print command for each line in <path>/etc/passwd</path>, in order. All output is sent to stdout, and we get a result identical to catting <path>/etc/pass</path>. </p> <p> Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed. </p> <pre caption="Printing the current line"> $ <i>awk '{ print $0 }' /etc/passwd</i> $ <i>awk '{ print "" }' /etc/passwd</i> </pre> <p> In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. </p> <pre caption="Filling the screen with some text"> $ <i>awk '{ print "hiya" }' /etc/passwd</i> </pre> </body> </section> <section> <title>Multiple fields</title> <body> <pre caption="print $1"> $ <i>awk -F":" '{ print $1 $3 }' /etc/passwd</i> halt7 operator11 root0 shutdown6 sync5 bin1 <comment>....etc.</comment> </pre> <pre caption="print $1 $3"> $ <i>awk -F":" '{ print $1 " " $3 }' /etc/passwd</i> </pre> <pre caption="$1$3"> $ <i>awk -F":" '{ print "username: " $1 "\t\tuid:" $3" }' /etc/passwd</i> username: halt uid:7 username: operator uid:11 username: root uid:0 username: shutdown uid:6 username: sync uid:5 username: bin uid:1 <comment>....etc.</comment> </pre> </body> </section> <section> <title>External scripts</title> <body> <pre caption="Sample script"> BEGIN { FS=":" } { print $1 } </pre> <p> The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F":" option to awk on the command line. It's generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We'll cover the FS variable in more detail later in this article. </p> </body> </section> <section> <title>The BEGIN and END blocks</title> <body> <p> Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program. </p> <p> Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream. </p> </body> </section> <section> <title>Regular expressions and blocks</title> <body> <pre caption="Regular expressions and blocks"> /foo/ { print } /[0-9]+\.[0-9]*/ { print } </pre> </body> </section> <section> <title>Expressions and blocks</title> <body> <pre caption="fredprint"> $1 == "fred" { print $3 } </pre> <pre caption="root"> $5 ~ /root/ { print $3 } </pre> 1.1 xml/htdocs/doc/en/articles/l-awk2.xml file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo Index: l-awk2.xml =================================================================== <?xml version='1.0' encoding="UTF-8"?> <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk2.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ --> <!DOCTYPE guide SYSTEM "/dtd/guide.dtd"> <guide link="/doc/en/articles/l-awk2.xml"> <title>Awk by example, Part 2</title> <author title="Author"> <mail link="[EMAIL PROTECTED]">Daniel Robbins</mail> </author> <author title="Editor"> <mail link="[EMAIL PROTECTED]">Åukasz Damentko</mail> </author> <abstract> In this sequel to his previous intro to awk, Daniel Robbins continues to explore awk, a great language with a strange name. Daniel will show you how to handle multi-line records, use looping constructs, and create and use awk arrays. By the end of this article, you'll be well versed in a wide range of awk features, and you'll be ready to write your own powerful awk scripts. </abstract> <!-- The original version of this article was published on IBM developerWorks, and is property of Westtech Information Services. This document is an updated version of the original article, and contains various improvements made by the Gentoo Linux Documentation team --> <version>1.0</version> <date>2005-07-27</date> <chapter> <title>Records, loops, and arrays</title> <section> <title>Multi-line records</title> <body> <note> The original version of this article was published on IBM developerWorks, and is property of Westtech Information Services. This document is an updated version of the original article, and contains various improvements made by the Gentoo Linux Documentation team. </note> <p> Awk is an excellent tool for reading in and processing structured data, such as the system's <path>/etc/passwd</path> file. <path>/etc/passwd</path> is the UNIX user database, and is a colon-delimited text file, containing a lot of important information, including all existing user accounts and user IDs, among other things. In <uri link="/doc/en/articles/l-awk1.xml">my previous article</uri>, I showed you how awk could easily parse this file. All we had to do was to set the FS (field separator) variable to ":". </p> <p> By setting the FS variable correctly, awk can be configured to parse almost any kind of structured data, as long as there is one record per line. However, just setting FS won't do us any good if we want to parse a record that exists over multiple lines. In these situations, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends and a new record begins. </p> <p> As an example, let's look at how we'd handle the task of processing an address list of Federal Witness Protection Program participants: </p> <pre caption="Sample entry from Federal Witness Protection Program participants list"> Jimmy the Weasel 100 Pleasant Drive San Francisco, CA 12345 Big Tony 200 Incognito Ave. Suburbia, WA 67890 </pre> <p> Ideally, we'd like awk to recognize each 3-line address as an individual record, rather than as three separate records. It would make our code a lot simpler if awk would recognize the first line of the address as the first field ($1), the street address as the second field ($2), and the city, state, and zip code as field $3. The following code will do just what we want: </p> <pre caption="Making one field from the address"> BEGIN { FS="\n" RS="" } </pre> <p> Above, setting FS to "\n" tells awk that each field appears on its own line. By setting RS to "", we also tell awk that each address record is separated by a blank line. Once awk knows how the input is formatted, it can do all the parsing work for us, and the rest of the script is simple. Let's look at a complete script that will parse this address list and print out each address record on a single line, separating each field with a comma. </p> <pre caption="Complete script"> BEGIN { FS="\n" RS="" } { print $1 ", " $2 ", " $3 } </pre> <p> If this script is saved as <path>address.awk</path>, and the address data is stored in a file called <path>address.txt</path>, you can execute this script by typing <c>awk -f address.awk address.txt</c>. This code produces the following output: </p> <pre caption="The script's output"> Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345 Big Tony, 200 Incognito Ave., Suburbia, WA 67890 </pre> </body> </section> <section> <title>OFS and ORS</title> <body> <p> In address.awk's print statement, you can see that awk concatenates (joins) strings that are placed next to each other on a line. We used this feature to insert a comma and a space (", ") between the three address fields that appeared on the line. While this method works, it's a bit ugly looking. Rather than inserting literal ", " strings between our fields, we can have awk do it for us by setting a special awk variable called OFS. Take a look at this code snippet. </p> <pre caption="Sample code snippet"> print "Hello", "there", "Jim!" </pre> <p> The commas on this line are not part of the actual literal strings. Instead, they tell awk that "Hello", "there", and "Jim!" are separate fields, and that the OFS variable should be printed between each string. By default, awk produces the following output: </p> <pre caption="Output produced by awk"> Hello there Jim! </pre> <p> This shows us that by default, OFS is set to " ", a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. Here's a revised version of our original <path>address.awk</path> program that uses OFS to output those intermediate ", " strings: </p> <pre caption="Redefining OFS"> BEGIN { FS="\n" RS="" OFS=", " } { print $1, $2, $3 } </pre> <p> Awk also has a special variable called ORS, called the "output record separator". By setting ORS, which defaults to a newline ("\n"), we can control the character that's automatically printed at the end of a print statement. The default ORS value causes awk to output each new print statement on a new line. If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or, if we wanted records to be separated by a single space (and no newline), we would set ORS to " ". </p> </body> </section> <section> <title>Multi-line to tabbed</title> <body> <p> Let's say that we wrote a script that converted our address list to a single-line per record, tab-delimited format for import into a spreadsheet. After using a slightly modified version of <path>address.awk</path>, it would become clear that our program only works for three-line addresses. If awk encountered the following address, the fourth line would be thrown away and not printed: </p> <pre caption="Sample entry"> Cousin Vinnie Vinnie's Auto Shop 300 City Alley Sosueme, OR 76543 </pre> 1.1 xml/htdocs/doc/en/articles/l-awk3.xml file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo Index: l-awk3.xml =================================================================== <?xml version='1.0' encoding="UTF-8"?> <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/articles/l-awk3.xml,v 1.1 2005/07/28 08:04:04 neysx Exp $ --> <!DOCTYPE guide SYSTEM "/dtd/guide.dtd"> <guide link="/doc/en/articles/l-awk3.xml"> <title>Awk by example, Part 3</title> <author title="Author"> <mail link="[EMAIL PROTECTED]">Daniel Robbins</mail> </author> <author title="Editor"> <mail link="[EMAIL PROTECTED]">Åukasz Damentko</mail> </author> <abstract> In this sequel to his previous intro to awk, Daniel Robbins continues to explore awk, a great language with a strange name. Daniel will show you how to handle multi-line records, use looping constructs, and create and use awk arrays. By the end of this article, you'll be well versed in a wide range of awk features, and you'll be ready to write your own powerful awk scripts. </abstract> <!-- The original version of this article was published on IBM developerWorks, and is property of Westtech Information Services. This document is an updated version of the original article, and contains various improvements made by the Gentoo Linux Documentation team --> <version>1.0</version> <date>2005-07-27</date> <chapter> <title>String functions and ... checkbooks?</title> <section> <title>Formatting output</title> <body> <p> While awk's print statement does do the job most of the time, sometimes more is needed. For those times, awk offers two good old friends called printf() and sprintf(). Yes, these functions, like so many other awk parts, are identical to their C counterparts. printf() will print a formatted string to stdout, while sprintf() returns a formatted string that can be assigned to a variable. If you're not familiar with printf() and sprintf(), an introductory C text will quickly get you up to speed on these two essential printing functions. You can view the printf() man page by typing "man 3 printf" on your Linux system. </p> <p> Here's some sample awk sprintf() and printf() code. As you can see, everything looks almost identical to C. </p> <pre caption="Sample awk sprintf() and printf() code"> x=1 b="foo" printf("%s got a %d on the last test\n","Jim",83) myout=("%s-%d",b,x) print myout </pre> <p> This code will print: </p> <pre caption="Code output"> Jim got a 83 on the last test foo-1 </pre> </body> </section> <section> <title>String functions</title> <body> <p> Awk has a plethora of string functions, and that's a good thing. In awk, you really need string functions, since you can't treat a string as an array of characters as you can in other languages like C, C++, and Python. For example, if you execute the following code: </p> <pre caption="Example code"> mystring="How are you doing today?" print mystring[3] </pre> <p> You'll receive an error that looks something like this: </p> <pre caption="Example code error"> awk: string.gawk:59: fatal: attempt to use scalar as array </pre> <p> Oh, well. While not as convenient as Python's sequence types, awk's string functions get the job done. Let's take a look at them. </p> <p> First, we have the basic length() function, which returns the length of a string. Here's how to use it: </p> <pre caption="length() function example"> print length(mystring) </pre> <p> This code will print the value: </p> <pre caption="Printed value"> 24 </pre> <p> OK, let's keep going. The next string function is called index, and will return the position of the occurrence of a substring in another string, or it will return 0 if the string isn't found. Using mystring, we can call it this way: </p> <pre caption="index() funtion example"> print index(mystring,"you") </pre> <p> Awk prints: </p> <pre caption="Function output"> 9 </pre> <p> We move on to two more easy functions, tolower() and toupper(). As you might guess, these functions will return the string with all characters converted to lowercase or uppercase respectively. Notice that tolower() and toupper() return the new string, and don't modify the original. This code: </p> <pre caption="Converting strings to lower or uppercase"> print tolower(mystring) print toupper(mystring) print mystring </pre> <p> ....will produce this output: </p> <pre caption="Output"> how are you doing today? HOW ARE YOU DOING TODAY? How are you doing today? </pre> <p> So far so good, but how exactly do we select a substring or even a single character from a string? That's where substr() comes in. Here's how to call substr(): </p> <pre caption="substr() function example"> mysub=substr(mystring,startpos,maxlen) </pre> <p> mystring should be either a string variable or a literal string from which you'd like to extract a substring. startpos should be set to the starting character position, and maxlen should contain the maximum length of the string you'd like to extract. Notice that I said maximum length; if length(mystring) is shorter than startpos+maxlen, your result will be truncated. substr() won't modify the original string, but returns the substring instead. Here's an example: </p> <pre caption="Another example"> print substr(mystring,9,3) </pre> <p> Awk will print: </p> <pre caption="What awk prints"> you </pre> <p> If you regularly program in a language that uses array indices to access parts of a string (and who doesn't), make a mental note that substr() is your awk substitute. You'll need to use it to extract single characters and substrings; because awk is a string-based language, you'll be using it often. </p> <p> Now, we move on to some meatier functions, the first of which is called match(). match() is a lot like index(), except instead of searching for a substring like -- [email protected] mailing list
