Hi,

I just trying to wrap my head around the hadoop programming model. So
pl be gentle :)

As a non trivial example program I am trying to parse standard postfix
log file to extract two set of data. One, details about a mail sent,
and second the concurrency of connections to the server. log file
consists of two types of lines corresponding to the two set if info I
want to extract, couple of lines about a mail sent, and single line
for connect and disconnect. an example is like this:

Jun 12 23:29:02 chn-smtp postfix/smtpd[19603]: 9107EC2C1B:
sasl_method=LOGIN, [EMAIL PROTECTED]

Jun 12 23:29:02 chn-smtp postfix/cleanup[19674]: 9107EC2C1B:
message-id=<[EMAIL PROTECTED]>

Jun 12 23:29:06 chn-smtp postfix/qmgr[24008]: 9107EC2C1B:
from=<[EMAIL PROTECTED]>, size=183009, nrcpt=1 (queue active)

Jun 12 23:29:08 chn-smtp postfix/smtp[19677]: 9107EC2C1B:
to=<[EMAIL PROTECTED]>,
relay=gmail-smtp-in.l.google.com[209.85.143.27]:25, delay=5.8,
delays=3.6/0.03/0.32/1.9, dsn=2.0.0, status=sent (250 2.0.0 OK
1213293645 w12si1745437tib.1)

Jun 12 23:29:08 chn-smtp postfix/qmgr[24008]: 9107EC2C1B: removed

Jun 12 06:27:26 chn-smtp postfix/smtpd[3273]: connect from unknown[xx.xx.xx.xx]

Jun 12 06:27:26 chn-smtp postfix/smtpd[3273]: disconnect from
unknown[xx.x.xx.xx]

as you can see each line about mail has a unique msg id (9107EC2C1B),
while connect/disconnect lines do not have that.

My plan for attack is to write a map function which reads in the logs
and emit one key value pair for each mail, with msg id as key and all
lines in corresponding to the key as value. reduce function can read
in each key value pair and write out the details about one mail.

The problems here are that, details about a single mail can span
multiple day's log file (ie mail is in queue for multiple days). so
reduce function may not have all data required to  write full
information about a mail, in such case a partial out put should be
written, giving current status, and the input should be updated by
next days map and tried again, writing full information if available.
Is this the best way to solve this problem using MR?  How to take care
of missing entries?

I have no idea how to solve the second problem using MR. ie given a
list of time series with connect/disconnect lines, I want to make a
histogram  of number of simultaneous connections.

As you can see I am just learning the Hadoop way of solving a problem,
I have not yet touched the coding part. Want to get my thinking
correct before coding.

thanks and regards,

raj

Reply via email to