Hi all,
The task I'm trying to accomplish:
Given a log file containing several lines of white space delimited entries like
this:
[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not
exist
[Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not
exist
[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not
exist
[Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not
exist
filter lines that match the string " 08:", extract the 6th, 7th and 8th words
from that line, group all lines that have the the same resulting string, do a
count on them and sort the result in descending order of counts and print it
out. So in the example above we'd end up with an output like this:
("GET /url1 HTTP/1.1]:", 2)
("GET /url2 HTTP/1.0]:", 1)
Seems pretty straightforward, so I wrote a simple perl script to achieve this
task (see the bottom of this email).
The input file is 335 MB in size and contains about 2 million log line entires
in it. The perl script does a pretty decent job and finishes in about 3 seconds.
Now the interesting part. I decided to implememt this in Haskell (being my
favorite language and all) and ended up with the following code:
--- begin haskell code ---
import Text.Regex.Posix ( (=~) )
import qualified Data.List as List
import qualified Data.Map as Map
import qualified Data.ByteString.Lazy.Char8 as LB
main = do
contents <- LB.readFile "log_file"
putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) .
-- create a Map grouping & counting matching tokens and sort based on the
counts
List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList .
Map.fromListWith (+) . filtertokens .
LB.lines $ contents
where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) []
-- filter lines starting with " 08:"
where f = (=~ " 08:") . LB.unpack
-- extract tokens 6, 7 & 8 and create an association list like
so ("GET /url2 HTTP/1.0]:", 1)
g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8]
where xs = LB.words line
--- end haskell code ---
This haskell implementation takes a whopping 27 seconds to complete! About 9
times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2
and even went to the extent of fusing an adjacent map and filter using a foldr
like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents
maps etc. Still the same result.
I really hope I'm missing some obvious optimization that's making it so slow
compared to the perl version, hence this email soliciting feedback.
Thanks in advance.
P.S. For reference, here's my corresponding perl implementation:
--- start perl code ---
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
my %urls;
open(FILE, '<', $ARGV[0]);
while(<FILE>) {
if (/ 08:/) {
my @words = split;
my $key = join(" ", ($words[6], $words[7], $words[8]));
if (exists $urls{$key}) { $urls{$key}++ }
else { $urls{$key} = 1 }
}
}
for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" }
--- end perl code ---
_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe