Hi, I just trying to wrap my head around the hadoop programming model. So pl be gentle :)
As a non trivial example program I am trying to parse standard postfix log file to extract two set of data. One, details about a mail sent, and second the concurrency of connections to the server. log file consists of two types of lines corresponding to the two set if info I want to extract, couple of lines about a mail sent, and single line for connect and disconnect. an example is like this: Jun 12 23:29:02 chn-smtp postfix/smtpd[19603]: 9107EC2C1B: sasl_method=LOGIN, [EMAIL PROTECTED] Jun 12 23:29:02 chn-smtp postfix/cleanup[19674]: 9107EC2C1B: message-id=<[EMAIL PROTECTED]> Jun 12 23:29:06 chn-smtp postfix/qmgr[24008]: 9107EC2C1B: from=<[EMAIL PROTECTED]>, size=183009, nrcpt=1 (queue active) Jun 12 23:29:08 chn-smtp postfix/smtp[19677]: 9107EC2C1B: to=<[EMAIL PROTECTED]>, relay=gmail-smtp-in.l.google.com[209.85.143.27]:25, delay=5.8, delays=3.6/0.03/0.32/1.9, dsn=2.0.0, status=sent (250 2.0.0 OK 1213293645 w12si1745437tib.1) Jun 12 23:29:08 chn-smtp postfix/qmgr[24008]: 9107EC2C1B: removed Jun 12 06:27:26 chn-smtp postfix/smtpd[3273]: connect from unknown[xx.xx.xx.xx] Jun 12 06:27:26 chn-smtp postfix/smtpd[3273]: disconnect from unknown[xx.x.xx.xx] as you can see each line about mail has a unique msg id (9107EC2C1B), while connect/disconnect lines do not have that. My plan for attack is to write a map function which reads in the logs and emit one key value pair for each mail, with msg id as key and all lines in corresponding to the key as value. reduce function can read in each key value pair and write out the details about one mail. The problems here are that, details about a single mail can span multiple day's log file (ie mail is in queue for multiple days). so reduce function may not have all data required to write full information about a mail, in such case a partial out put should be written, giving current status, and the input should be updated by next days map and tried again, writing full information if available. Is this the best way to solve this problem using MR? How to take care of missing entries? I have no idea how to solve the second problem using MR. ie given a list of time series with connect/disconnect lines, I want to make a histogram of number of simultaneous connections. As you can see I am just learning the Hadoop way of solving a problem, I have not yet touched the coding part. Want to get my thinking correct before coding. thanks and regards, raj
